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UNIFIED PROBABILISTIC FRAMEWORK FOR PREDICTING AND 
DETECTING SEIZURE ONSETS IN THE BRAIN AND MULTTTHERAPEUTIC 
10 DEVICE 



CROSS-REFERENCE TO RELATED APPLICATIONS 

This application is related to co-pending patent application "Ad^tive Method and 
15 Apparatus for Forecasting and Controlling Neurological Disturbances Under a Multi- 
Level Control", serial xaaahec (attorney docket 1080 1020), having the same inventorship. 
The present application is also related to international application WO 00/10455, 
published under the Patent Cooperation Treaty (PCT) on March 2, 2000. The related 
patent applications are hereby incoq)orated by reference into this description as fidly as if 
20 here represented in foil. 

BACKGROUND OF THE INVENTION 

1 
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The present invention is in the field of medical devices to treat neurological 
disorders of the brain. More specifically, llie invention is directed to a method and a 
partially or fully in^planted ^paratus for predicting and detecting epileptic seizure onsets 
within a unified multiresolution probabilistic framework, thereby enabling a portion of 

5 the device to automatically deliver a progression of multiple ther^ies, ranging from 
benign to aggressive as the probabilities of seizure warrant, in order to deter the course of 
a seizure with only the minimally required intervention and associated side effects. 

Second to stroke, epilepsy is the most coimnon neurological disorder of the brain. 
It is characterized by recurrent seizures that significantiy iiapsdi the quality of life of an 

10 estimated 1 to 2% of the world's population. Drugs are the most common form of 
treatment, but their efBcacy is hmited. Up to 30% of patients achieve no control of their 
seizures with drugs, and another 30% e^erience severe side effects that make it 
impossible to lead normal lives. 

A personal device capable of warning and/or intervening therapeutically in 

15 response to imminoit seizures wouH allow those with epilepsy to, at a minimum retnove 
themselves from danger (e.g., stop driving a car), and in the best case become seizure- 
free, not even noticing times when they were about to have a seizure. Such a device 
would operate in a continuous-time closed control loop where tterapy is responsive to 
measurements (this includes a patient's own actions in the loop). 

20 Several prior art closed-loop responsive systems with applicability to improving 

the quality of life of qpileptic individuals are known to have been proposed m the field to 
date. All prior art systems share the following disadvantages: (1) they only detect 



2 



wo 02/058536 



PCT/USOl/50046 



visually obvious changes in raw signals thus control of seizures is attempted only after 
individuals actually begin having each seizure; (2) Ihey implement a deterministic 
approach which is inadequate in face of the uncertainty and complexity of the problem; 
(3) they offer no means of gauging confidence in the outputs; (4) they implicitly assiune a 

5 single (infinite) time resolution which may be adequate for seizure detection but not 
prediction; (5) th^ suggest a control scheme which is closed-loop only at the triggoing 
instant dictated by detection (treatment beyond that point is open-loop, and is called 
triggered open-loop control in flxe present invention); (6) they do not deliver therapy that 
is graded fijom benign to aggressive as the situation warrants; (7) Ihey do not consider 

10 side effects; (8) they imply detection schemes that are not guided by optimality criteria; 
(9) tiiey rely on a single input feature or multiple features of the same nature (e.g., power 
in frequency bands) or only few uncorrelated features; (10) they use the same features for 
the whole patient population and do not take advantage of patient-specific features; (11) 
they do not transfer adequate samples of data for offline analysis; (12) they possess little 

15 or no computational intelUgence with no learning capabilities to automatically improve 
and maintain performance over time; (13) they directly threshold separate single features 
instead of an impUcit likelihood ratio function of joint features thereby yielding 
suboptimal decision rules; and (14) they do not account for the feet that training and/or 
testing seizure detectors/predictors with wrong prior probabilities of seizures/preseizures 

20 (as reflected in raw data archives or clinical trials) mduces a variety of distortions that 
must be corrected. 
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The present invention is directed to overcome the disadvantages and linaitations of 
all prior art. 

SUMMARY OF THE INVENTION 

5 The invention is directed to a method and a partially or fiilly implanted apparatus 

for predicting and detecting epileptic seizure onsets within a unified multiresolution 
probabilistic framework, thereby enabling a portion of the device to automatically deliver 
a progression of multiple therapies, ranging fi»m benign to aggressive as the probabilities 
of seizure warrant, in order to prevent, aibort, or mitigate the intensity, duration, 

10 fiequency, and spread of seizures with only the minimally required intervention and 
associated side effects. Based on novel computational intelligence algorithms, a realistic 
posterior probability function p(Sj]x) representing the probability of one or more seizures 
starting wilhin the next y minutes, given observations x derived from intracranial EEG 
(lEEG) or other signals, is periodically synthesized for a plurality of prediction time 

15 horizons (scales j*. or resolutions llf), e.g., a second, a miirate, ten minutes, an hour, etc. 
When coupled with optimally determined thresholds for alarm or Hasxv^y activation, 
probabilities defined in this fashion provide anticipatory time-localization of events in a 
synergistic logarithmic-like array of time resolutions, thus effectively circumventing the 
performance vs. prediction-horizon tiradeoff of single-resolution systems. For example, 

20 whereas it is imrealistic to predict the exact onset time of a seizure as 9 minutes and 58 
seconds fi:om now, it is both realistic and usefid to predict that the onset will occur 
anytime witiiin the next 10 minutes, a time during which the seizure can be prevented 
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using a benign form of treatment. The longw and shorter prediction time scales are made 
to ccMTespond to benign and aggressive therapies respectively. In addition to providing 
degrees of confidence and fine monitoring of patients' states, probabilities can be 
advantageously treated as degrees of "imminence" of events. Such degrees in turn serve 

5 to modulate the dosage and other parameters of treatment during open-loop or feedback 
control of preseizuTBS once activation is triggered. Fast seizure onset detection is unified 
within the framework as a degenerate fbnn of prediction at the shortest, or even negative, 
time horizon. The device is required to learn in order to find the probabilistic prediction 
and control strategies that will increase tiie patient's quality of life over time. A quality- 

10 of-life index (QOLI) is used as an overall guide m the optimization of patient-specific 
signal features, the multitherapy activation decision logic, and to document if patients are 
actually improving. 

A distinguishing theme of the present invention is that prediction is achieved for 
most patients and circumstances well before electrographic onset of seizures, and before 

15 any changes in raw physiologic signals that are visually obvious *o ^ human expert. These 
prediction windows afford sufficient time to discourage seizures starting with mild forms 
of treatment, and escalatmg into multitherapeutic regimes only as it becomes necessary. 
Therefore, a principal objective of the invention is to avert seizures m flxe brain usmg 
only the minimally required interventions and then: attendant side effects. 

20 The present invention exploits the synergy of multiple signal features of a 

different nature. Features are accessed firom a rich feature library including instantaneous, 
historical, spatial, and artificial features. Patient-specific signal features are exploited. 
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Conventional features are custom-searched, and artificial features are custom-made, for 
each patient and prediction horizon, optimizing prediction performance and 
computational requirements. The invention exploits the synergy of multiple time 
resolutions in parallel. 

5 The invention displaj^ probabilities of oncoming seizures, each associated with a 

prediction horizon/iesolution, in order to indicate both the time firame when onsets are 
expected to occur, and degrees of confidence in the predictions. 

The value of ibs probabilities can be deliberately influenced by using them as 
controlled variables m a hierarchical seizure controller consisting of multither^y 

10 activation decision logic and triggered open-loop or feedback control laws/actuators. 

Multitherapy activation decisions can be based on user-selectable classifier-based 
optimality criteria (e.g., minimum error, minimum error risk, minimum overall risk, 
tninininTn Mse positives subject to constant false negatives, etc.), all in turn directed to 
maximize QOLL The invention unifies seizure onset detection as a degenerate form of 

15 prediction at the finest time resolutions. 

Because tiier^ies can change the very patterns that the device is designed to 
initially recognize, a seizure predictor-controller (or seizure onset detector-controller), 
must have learning capabilities, otherwise it is only a matter of days before it becomes 
ineffective. It is therefore a fiirttxer principal objective of the invention to teach novel 

20 computational intelligence learning algorithms required for a device to improve and 
maintain its performance over time. Such methods include the ability to correct for 
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mismatches between the prior piobability of preseizures/seizures that is iacorrectly 
inferred from training data, and the patient's real-life probabilities of those events. 

The above and other novel features, objects, and advantages of the invention will 
be imderstood by any person skilled in the art when reference is made to the following 
5 description of the preferred embodiments, taken in conjunction with the accompanying 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates a functional block diagram of the S3^em m online operation 

10 mode. 

Fig. 2 illustrates an exemplary multitherj^y activation decision logic. 
Fig. 3 illustrates a functional block diagram of the system in learning and 
programmiogmode. ' 

Fig. 4 illustrates a processing logic for the learning al^rithm required for flie 
15 system to improve and maintain performance over time. 

DETAILED DESCRIPTION OF THE INVEINTION 
The present invention is directed to a method and an apparatus for predicting and 
detecting the onset of seizure disorders within a unified, multuresolution probabilistic 
20 framework that enables a portion of the apparatus to deliver automatically a progression 
of multiple therapies. The flierapies range from benign to aggressive as the probabilities 
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of seizure warrant, in order to prevent, abort or mitigate the intensity, duration, frequency 
and spread of seizures. 

Fig. 1 illustrate an exemplary arcbitecture of a seizure prediction and control 
system 10 in an online operation mode, according to the present invention. A signal 

5 acquisition portion 20 conditions (amplifies, filths, isolates, multiplexes, etc.), and 
digitizes multiple raw signals, preferably intraa:amal EBG (lEEG). The acquired 
multichannel data are prepiocessed 30 to attenuate artifacts such as any residual 60 Hz 
contamination, bad transducer contact, etc. Other steps that may be associated with 
preprocessing in some cases but not universally, are considered to be part of the feature 

10 extraction algorithms. Examples of these include: bipolar derivation of two referential 
signals, and recurrent standardization of signal amplitudes based on exponentially 
weighted moving averages (EWMA) of the mean and standard deviation of those 
amplitudes. The buffer in this portion implements a sliding observation window of 
present and past data used by the feature extraction algorithms, typically spanning 

15 periods of 1 to 10 times the prediction horizon 7. Data subsampling allows the buffer size 
to remain fixed. 

The feature extraction portion 40 contains selectable algorithms and mathematical 
formulas that distill relevant seizure-predictive and seizure-indicative attributes firom the 
high-dimensional preprocessed signals. Multidiannel data Within the sliding observation 
20 window (typically 400 to hundreds of thousands of numbers) are converted into a low 
dimensional vector [^j xi - jcj containing 10 or less features for each time scale (usually 
40 or less numbers in the complete vector). Features can be any linear or nonlinear 

8 
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projection of raw data into anotiier space, including the same raw data without 
compression as a special case. Features can be updated as fast as the raw data sampling 
rate (e,g., 200 Hz), but typically the festest feature sanipling is set as one every 0,45 
seconds. Longer-horizon features can be updated less frequently according to a no-less- 

5 than-50% sliding window overly criterion. Distinctly from all the prior art 
contemplating features, the feature vector is made patient-specific and optimized &om a 
rich feature library of instantaneous, historical, spatial, and artificial features. An 
exemplary feature library is disclosed in co-pending Application Serial No. (Attorney 
Docket 1080 1020). Examples of useftil features include accumulated energy profiles, 

10 quick firactal dimension (curvelength), absolute values of wavelet coefficients, nonlinear 
energy, spectral entropy, prodrome templates, the statistical moments of any feature, and 
custom-miade genetically found, neurally computed features. 

The complete feature vector x is demultiplexed into constituent subvectors each 
corresponding to a particular time scale: x = [sf-^^ x^'^^ x<'> x(-^'^°>] (for brevity, time 

15 scale superscripts will be dropped hereinafter). Equivalently, the feature extraction 
portion 40 can be implemented as subsystems that individually compute feature vectors 
from different observation windows. Each probability estimator p(S^x), implemented by 
a wavelet neural network or any other universal approximator, draws from its recent 
collective knowledge regarding the behavior of its feature vector x in order to compute 

20 the probability that one or more seizures will begin at any time within the next j minutes. 
In Fig. 1, yis shown for 60, 10, 1, and -1/30 minutes. The latter defines a seizure onset 
detector within the first 2 seconds of electrographic seizure. The horizon f should not 
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extend beyond about one fourth (1/4) of the average inter-seizure period (e.g., if a patient 
is known to have one seizure a day, then it is not informative to predict a seizure within 
the next day, or week, or year, etc., but it becomes inareasingly useful within the next 6 
hours and less). Instead of a single-resolution system attempting to determine the exact 

5 time remaining to seizure after a precursor occurs, the synergistic logarithmic-like array 
of time resolutions is used to cover a wide range of time frames when precursors may 
occur. This, in conjunction with the way in which the Sj. event is defined, effectively 
circumvents the performance vs. prediction-horizon tradeoff of single-resolution systems. 
The posterior probability provides mfbrmation beyond the prior probability of 

10 preseizures, p(Sj), to the extent that the feature vector x behaves differently under 
preseizure and ncmpreseizure conditions. For exaoqile, if the statistical behavior of x is 
identical under the two classes (including seizures and nonseizures as degenerate special 
cases), then p(S^x) = p(Sy), intelligent time-localization of events would not be possible 
using those features, and the device would operate in sensorless, purely open-loop mode. 

15 It should be noted that equal distributions do not imply that x is not useful; a dynamic 
feature of x may well be separable, but that is a new and different x. It is well known that 
EEG features are vastly different under seizure and nonseizure conditions, to the extreme 
of visual obviousness in the raw signals, so that automatic seizure detection with very 
high reliability is straightforward. Research to date has also shown that there esxists 

20 premonitory features, most often in raw signals invisible to the naked eye, that enable 
preseizure detection with degrees of certainty ranging from moderate to very high. An 
important additional consideration for a feature to be discriminatory is that its variances 
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under each class be relatively small. If a feature varies wildly at every feature sampling 
time, so will the probability pCS^^- This is simply a reflection of the fact that the feature 
is an inconsistent correlate of the classes. Integrating a series of '*pieprobabilities," as in 
sequential hypothesis testing, and taking that as a new input feature, or calculatiag 
5 moving averages or standard deviation charts of features, are all examples of maldng 
features more historical in nature so as to stabilize their behavior and tJms their utility to 
the probability function. 

The probability estimation portion 50 outputs a probability vector P = [pgo Pio 

Pi 

10 p_i/3o], which is then taken as input to a hierarchical multiple-input-multiple-output 
seizure controller 70. The multither^y activation decision logic block 60 determines 
which therapy modalities (control laws and therapy actuators 70) are activated or 
deactivated at any given time. Therapies can include cognitive stimulation 72 (with 
speech synthesizer that reads out an arithmetic or other mental task), sensory stimulation 

15 74 (audio, visual, tactile, olfactory, kinesthetic, or other), biofeedback, electrical 
sthnulation 78, pharmacological in&sion 76, or other. For each therapy modality that is 
activated, a control law „ = g(P) defines exactly how that therapy actuates on the brain. In 
triggered open-loop mode, fixed parameters such as drug dosage, or amplitude, 
firequency, pulse width, phase, etc., of preprogrammed electrical stimulation waveforms 

20 are obtained by considering the probability vector only at the time of activation: y = 
g(P(x(^g))). More generally, in feedback control mode 70, the control action is a function 
of the continuously varying probability vector: „(t) = g(P(x(f))). In this case. 
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proportional, proportional-integral-derivative (PID), optimal continuous, gain-scheduled, 
multilevel, bang-bang, or other control strategies are envisioned in order to regulate 
P(x(£)), as a controlled variable, back to the zero vector [0000]. That is, intervention is 
automatically continuously modified so as to steer neural activity away from conditions 

5 known to be consistent with the fljture occurrence of seizures. Feedback control 70 ofifers 
the most finely graded fotm of treatment and potential for absolutely minimal 
intervention and the attendant side effects, howeva:, obtaining stable closed-loop controls 
requires more extensive experimentation during presurgical evaluation than the triggered 
open-loop alternative. The device can also be programmed to work m purely open-loop 

10 mode (delivering prophylactic treatment at preset on and off times), and can be manually 
operated by the patient via a button, magnet, vibration transducer, or any other switching 
technique. 

The device transmits its internally buffered data and other variables around the 
times of all fher^y activations, as well as at a random or a preprogrammed sampling of 

15 times, including all times, by well known radiotelemetry. The probabiUty vector P and 
multitherapy activation status are displayed in the patient's wearable access unit 80. The 
steps of preprocessing 30, feature exbraction 40, probability estimation 50, and 
multither^y decision logic 60 can be implemented in a microprocessor, a digital signal 
processor PSP), a field programmable gate array (FPGA), an application specific 

20 integrated circuit (ASIC), hybrid analog/digital circuitry, or combinations thereof 
following design principles well known in the microelectronics industry. 
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This intelligence structure is coupled to an array of interventions based, upon 
electrical stimulation, chemical infusion and synliiesis of artificial neuronal signals to 
counteract developing seizures as precursors build over time. The intensity of 
intervention, modality of therapy and spatial distribution of therapy are all adjusted as the 

5 probability of seizures increases over time. A guiding principle of these interventions is 
that the most benign forms of ther^y are initiated relatively early in seizure generation 
and over a relatively small region of the brain, so as to cause little or minimal disruption 
of normal activity when fbs probability of seizure onset is relatively low. This wiU allow 
intervention to be triggered by prediction fhrediolds with high sensitivity (e.g., very low 

10 felse negative rate) at the cost of a relatively low specificity (e.g., relatively high false 
positive rate). As the probability of seizures increases, therapeutic stimuli are uicreased 
in intensity, duration, frequency of delivary, and are delivered over a wider area of the 
brain. Since patterns of seizure precursors and their spread in space and time leading up 
to seizures are me^ped and used to train the device on each individual patient, therapy is 

15 delivered over broader areas, just ahead of the anticipated region of spread, as seizure 
precursors develop, if tiiey do not respond to earlier treatment. In this scheme, tibien^y 
can be delivered locally, in the region of onset, in. a distribution surrounding the region of 
onset, isolating it from recruitmg adjacent regions of the brain and spreading. Therapy 
can also be delivered locally and/or remotely in subcortical regions such as the thalamus, 

20 basal ganglia, or other deep nuclei and regions, escalating in intensity, type of stimulus 
and distribution of action, as seizures progress. This same principle is applied to 
therapeutic intervention if electrical seizure onset takes place, effecting treataient in the 
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general region of onset, in deep brain structiires which modulate the behavior of the 
seizure focus, or both simultaneously. 

Interveations can include the following: (1) rhythmic electrical pacing, which 
changes in frequency, intensity and distribution as the probability of seizure onset reaches 

5 a threshold and increases; (2) chaos control pacing; (3) random electrical stimulation to 
interfere with developing coherence in activity in the region of and surrounding the 
epileptic focus; and (4) depolarization or hyperpolarization stimuli to silence or suppress 
activity in actively discharging regions or regions at risk for seisture spread. This activity 
can also be delivered to numerous electrode sites to create a type of "surround inhibition" 

10 to prevent progression of seizure precuisors. These stimuli can also be delivered 
sequentially in a "wave" that sweeps over a region of tissue, so as to progressively inhibit 
normal or pathological neuronal ftinction ui a given Fegion(s) or tissue, including cortical 
and subcortical regions. 

The principle of altering and developing ther^y in req)onse to the changing 

15 probability of seizure, and/or the detection of specific events in seizure evolution, 
including electrical seizure onset and spread, is also appUed to the delivery of chemical 
therapy. In this fashion, active therapeutic agents are infused or otherwise released in flie 
brain regions where seizures are generated, or to where seizures may spread. As seizures 
become more likely, the amount, concentration or spatial distribution throu^ which a 

20 chemical agent is delivered are all increased. As with electrical or other therapeutic 
interventions, patterns of delivery can include infusmg a drug directiy m the epileptic 
focus, in an area surrounding it, or to re^ons involved in early spread, or to more central 
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or deep brain regions, which may modulate seizure propagation. These same therapeutic 
principles ^ply to distribution of maximal therapy when electrical seizure onset is 
detected, including distributing therapy to regions where seizures are known to spread 
and propagate. Last-minute treatment may include release of larger amoimts of drug into 

5 the cerebrospinal fluid (CSF) space for circulation over wide regicxns of the brain or into 
the cerebral circulation. Other types of phaxmacolo^cal agents may also be used in this 
scheme, such as agents wMch are activated by oxidative stress, which may themselves 
increase the concentration and distribution of an active therapeutic agent as seizure 
precursors evolve and the probability of seizures increases. 

10 Therapy may also include delivery of stimuli, electrical, chemical or other, to 

peripheral or central nerves or blood vessels, in a graded &shion, as the probability of 
seizures increases, building up to therapy of maximal intensity at the detection of 
electrical seizure onset. Therapy may also include sensory stimulation (touch, 
temperature, visual, auditory etc.). 

15 Finally, therapy may consist of synthesized, artificial neuronal signals delivered in 

such a way as to disrupt electrochemical traffic on the appropriate neuronal networks 
including or communicating with the ictal onset zone. Examples of such interventions 
might include transmission of synthesized signals which increase the output of specific 
cell populations, such as inhibitory intemeurons, specific nuclear regions in the thalamic 

20 or other deep structures. 

Using any or all of these methods singly, or in combination, therapy is directed 
toward preventing seizure onset, or isolating the developmesA of seizures and their 
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propagation so as to prevent or minimize clinical symptoms and the impact of these 
ev^ts. 

Fig. 2 shows an example of a simple, yet very effective decision logic for 
multitherapy activation. The mixltivariable relation between probabilities and therapies is 

5 decoupled by tying each therapy mode to a particular prediction horizon: benign 
monotherapy to a large scale (coarse resolution), and aggressive monotiierapy to a small 
scale (fine resolution). This is based on the principle, borne out of research, that the 
earlier the course of a seizure is internq)ted, the more likely it is for a mild form of 
treatment to be successful. Therefore, as iti the exemplary decision logic, benign 

10 treatment is encouraged first in order to reach the point of minimally invasive but 
sufficient therapy. Each subsystem will be independently activated based upon its own 
j3(Sy|x), sometimes simultaaeously if such control force become necessary. As will be 
furfher disclosed hereinafter, this scheme makes it possible to learn each p(S^x) using 
simple classifier-based optimality criteria. It also allows the seizure controller ia the 

15 device to remain off most of the times, consuming only the minimum power necessary to 
avert seizures. 

The following simplified example serves to illustrate a worst-case scenario of the 
invaative system m onUne triggwed opoi-loop mode as a seizure approaches, but is not 
intended to preclude other ways m which the system could have behaved or could have 
20 been practiced. The probability vector for the next month in a patient's device is 
deteraiined to be P = IXSfiolxi) p(Sio|[x2 wl) P(SiU) P(S_i/3olxs)], where ;ci = 60- 
mittute resetting energy buildup at a focal channel, xi = e3q)onentially weighted movhig 



16 



wo 02/058536 



PCT/USOl/50046 



average (EWMA) of signal power 98% forgotten after 10 tnitiutes renormalized by last 
lioiir reference, xs - co™* of l-minute standard deviation excursions above a threshold 
over the last 10 minutes, xn = l-minute moving average of nonlinear energy, and xs ~ 
second signal curvelength. Two hours before (i.e., time -120 min.) the electrographic 

5 onset (time zero) of a particular seizanre, the p values are hovering close to zero. Since the 
prior probabilities of preseizure-class features are very small for the chosen prediction 
time scales, the posterior p values will also tend to remain close to zero unless really 
unique feature changes are observed. At time -40 min., P = [0.65 0.1 0.2 10^^]. Note 
fliat the p values are not necessarily monotonically related (e.g., pi "more likely" than 

10 pia) because the estimators are different mathematical functions trained to examine 
different features over the dififeimt time scales. In fact, if such monotonicity always held 
true, then the synergistic multiiesolution advantage of the invention would be lost. The 
value Pa, akeady reached a threshold that triggers an audio waveform with 
preprogrammed characteristics. Note that a 60-minute prediction horizon does not imply 

15 that activation will happen exactiy at = -^0 min., though earlier activation does imply 
an error (pgo's job is to time-localize within a sfpecific 60-mmute time frame, not predict 
at "random" times). It is 3:00 am and flie patient is awakened by the device, but through 
the wearable access unit 80, he or she may choose to block out this type of treatment 
during programmed times of the day, or a sleep state detector may do so automatically. 

20 Within a minute, peo falls below its activation tiireshold and stimulation shuts down, 
however at time -12 min. it goes back up, and at -8 min., pio also exceeds its threshold 
and a speech synthesdzer temporarily disables audio stunulation while it asks the psA&xt 



17 



wo 02/058536 



PCTAJSOl/50046 



to carry out a mental calculation. The patient does so while simultaneous audio 
stimulation takes place. Despite these efforts to discourage seizaire-inducing syochronized 
neural activity, pi indicates at time -30 sec. that the seizure is imminent, and electrical 
stimulation directly to the brain or to nerves under the skin is initiated. At one second 

5 past electrograpMc onset, still before any clinical symptoms, p_i/3o reaches 0.95 and a 
drug is released du:ectly to the brain or through the spinal cord. If the probability 
estimators were not trained including time segments of seizures under treatment, then 
they are temporarily disabled at this point (but not their associated therapies). Within a 
few more seconds, all traces of seizure have subsided and the seizure conttoUer 70 is shut 

10 down. The skilled practitioner will recognize that progressive multitharapy as in the 
present invention is gentle enough to exhaust the possibilities of milder and safer forms 
of intervention first, but tough enou^ to ensure that a seizure will never go past 
treatment into its fuU e^qiression. 

In a close variation of the treatment decision logic of Fig. 2, two or more 

15 duplicate time scales can be used to activate the same therapy modality but differing only 
in treatment parameters, based on mutually exclusive intervals of p (instead of a 
threshold). For example, a small dosage may be used if 0.7 ^p< 0.8, a moderate one if 
0.8 ^ P < 0.9, and a strong one if p ^ 0.9. The actual values of the thresholds can be 
much smaller for small time scales as will become apparent fix)m subsequent 

20 description of the methods. 

A system aimed to reduce or eliminate seizures, such as the pr^ent invention, 
must have interventive capabilities, otherwise it would only detect or predict seizures 
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silently and without affecting the brain. If there is intervention, then the observable 
behavior of seizures changes over time (consider for example decreased seizure 
frequency, effects of kindUng, or the difference between EEG data with and without drug 
tiering). If the patterns of seizures change over time, then the device must have learning 

5 capabilities, otherwise it will be only a matter of days before liie system becomes 
ineffective. Therefore, advantageously over all prior art, the present invention conceives 
and discloses computational intelligence learning algorithms necessary to improve and 
maintain the device effectiveness over time. 

Fig. 3 is an exemplary functional block diagram of the seizure predictor-controUer 

10 system in learning and programming mode. During preoperative evaluation and 
periodically thereafter, typically once every month, an autiiorized caregiver is able to 
access the past period's data from the external storage device 90, train the feature 
extraction 40, probability estimation 50, and multiflier^y activation decision logic 60 
portions of the patient's device for optimum performance over the next period, download 

15 the learned algorithms and parameters into the device's electronically erasable 
programmable read-only memory (EEPROM) via transceivers 95, program other device 
settings, and telemeter the device settings and operating variables at any tune, including 
during online mode, all through a centralized learning and programming software 
interfece. The laptop or workstation computer 100 where this software runs can faithfully 

20 simulate the actions of the feature extraction 40, probability estimation 50, and 
multitherapy activation decision logic 60 portions of the device, and more importantly, 
can learn how to alter those portions in order to bring about performance improvement 
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and maintenance. The ability of the system to cany out this fiinction externally is crucial, 
since the required learning algorithms are C3PU-time and memory expensive. "Learning" 
within the implanted device itself under severe clock speed, memory, and power 
constraints necessarily involves depriving the device of most of its potential mtelligence. 

5 The learning and programming functions can also be carried out regardless of the 
patient's or caregiver's physical locations. A connection is established through the public 
switched telephone network, cable network, or digital subscriber line, and a service 
provider, or through local area network, collectively 110, tben via a Tl line or other high 
speed digital link to the Internet 120, all the way to the remote caregiver's portable 

10 computer 1 30 or to a remote fecility 140 where the same software interfece runs. 

The first training of the system can be based on lEEG data archived during 
rotitine preoperative evaluation, usually over a period of three to fourteen days. An 
optional bedside verdon of the device can be used during this period prior to transferring 
the initial algorithms and parameters into the partially or fully implanted device. The 

IS bedside unit can be built by iir^lementmg signal acquisition 20, preprocessing 30, feature 
extraction 40, probability estimation 50, and activation decision logic 60 within the 
laptop computer or workstation 100 that runs the programming and learning software 
interface, and a set of external therapy actuators. In some cases where a target location of 
electrodes is not well defined, especially in patients with generalized seizures, the 

20 archival during presurgical evaluation step can be obviated and the device can be 
implanted untrained to serve as the data collector itself. 
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Learning in the present invraitive system results fiom a long-term iterative 
optinuzation procedure, starting during presurgical evaluation and then periodically 
thereafter, typically once every month. Patterns of seizure evolution are tradced spanning 
aU treatment periods, including tapered off medication, no treatment, and multitherapy. 

5 Fig. 4 illustrates the processing logic of the learning algorithms involved (dotted Imes 
indicate optional steps). The following eight steps and substeps are carried out for each 
patient and each time scale j except where reference is made to QOLI and unequivocal 
electrographic onsets (UEOs) (each patient has only one QOLI and one set of UEOs), and 
to the feature hbrary (there is only one library defined independently of patients and time 

10 scalei^ although it may also be improved ftom time to time). 

Step 1: A quality-of-life index (QOL]^ that penalizes the intensity, duration, and 



frequency of both seizures and treatment intervoitions (all disturbances to 



the patient) over time is charted to serve as an overall guide for 



adjustmente and decision making and to document progress (logic block 



15 



402). 



Step 2: 



An ejcpert neurologist inspects the lEEG that was digitally archived m the 



external storage device over the last period, and marks the times of 



unequivocal electrographic onset (UEO) in all recorded seizures (logic 



block 404). Optionally, times of asleep-awake states and ti^eatinent 



20 



information are also logged. 



Step 3: 



Based on the LfEOs, learning sets of data are created by clipping all tiie 



minute lEEG q>ochs immediately preceding seizures and labeling them as 
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"jL-miimte piesdzures" (class Sj). Actually, more than exactly T minutes 
of raw data are clipped, to allow for the initialization period required by 
historical features. Similarly, randomly chosen, nonoverlsqyping exanqples 
(preferably equal in number to the number of preseizures, or more) of'j^ 

5 minute nonpreseizures" (class NSj) are clipped and labeled. The latter are 

preferably chosen to be "baselines" in the sense tiiat they are distanced in 
both directions of time at least 3 j minutes away fiom the nearest seizure 
(class S). Note that the above procedure includes examples of "seizure 
onsets" (class S.i/jo). defined as the first 2 seconds after UEO, and 

10 "nonseizures" (class NS, or more precisely, "nononsets" NS_iy3o that 

qualify as baselines). The class Sj, and class NS^ data sets for the last 
month are joined with ihe data, sets firom the fliree previous months, so that 
the new month has only an incremental effect on the system. The 
influence that data will have on learning corresponds to either a 4-month 

15 rectangular moving window, or a window that tapers off the last 4 months 

according to some forgetting schedule (logic block 406). 
Step 4: Optionally, parameters associated with the time history of signal features 
such as observation window length, displacement between window slides 
(or feature sampling period), and thresholds (on amplitude, duration, 

20 density count, etc.) are "preoptimized" using statistical measures of linear 

separability such as ^ scores, Fisher disariminant ratios, jj^-factors, etc. 
(logic block 408). A short calibration epoch of class Sj, and another of 
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class NSj. are used to generate a time-series of eacli feature ia the feature 
library under the hypothesized set of parameters for that feature. A 
parameter set that maximizes separabiKty is chosen by a greedy grid 
search varyiag one parameter at a time. Examples can be found in co- 
5 pending application serial no. (Attorney Docket 1080 1020). Direct use of 

the nonlinear sq)arability measure of Step 6 below is also possible, but 
research has shown tiiat feature parameters often only marginally affect a 
classifier's ability to s^arate the features. 
Step 5: Using the feature parameters found in Step 4, or preestablished ones, time- 

10 . series of all features in the feature Ubrary are generated from all the class 

Sy and class NSj,raw data epochs (logic block 410). 
Step 6: An optimal feature vector is searched in the power set of the feature 
library to niiimnize the expected overall risk rq (a classifier-based 
performance metric that relates to QOLQ or other selected criterion, and 

IS real-time computational requirements (logic block 412). For a feature 

library of size j\^f 30 and feature dimensions „ 5, this step can be 
solved by exhaustive search (< 174,436 evaluations); however, in general, 
heuristic searches such as add-on-knock-out (an improved version of 
forward sequential search) are required. Feature optimization uivolves the 

20 Steps enumerated herein. Class Sj, and class NSj, trajectories of a 

candidate feature vector are obtained by time-synchronizing the 
corresponding time-series in Step 5 (logic block 420). Mterpolation may 
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be required if the feature sampling periods are different, althougti they will 
typically be the same within the same time scale. Nonparametric 
classifiers that memorize the training data set (randomly chosen vectors 
typically representing 70% of all the available feature vectors) such as k- 
nearest neighbors C^OSINs), probabilistic neural networks (PNNs,) or 
hybrids, are synthesized and used to simulate decision rules that are 
equivalent to each therapy activation unit in Fig. 2 (logic block 422). 
Their implementation typically does not involve actual probabilities and 
thresholds, but rather discriminant functions and con^etitive layers. With 
the option of tapered-off data influence, the kernel heights of PNN, or the 
distances in kNN, are weighted with forgetting factors according to flie 
age of each training datum. The classifiers* discriminant fimctions are 
corrected to obtain the desned optimal decision rule c*(x) (logic block 
424). The overall risk is measured over a validation data set that is not 
directly used for synthesizing the classifiers (randomly chosen vectors 
typically representing the remaining 30% of all the available feature 
vectors), and a score 5 = 100 - jjo% is computed for the candidate feature 
vector/optimal classifier combination (logic block 426). If necessary, the 
score is penalized to account for computational expense. The next 
candidate feature vector is determined ftom the heuristic search criterion 
(logic block 428), and logic blocks 420-428 are iterated. When finished, 
the selected feature vector is optionally fiirther processed by a 
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computationally intensive procedure that compresses it into a single 
genetically-found, neurally computed artificial feature with equal or better 
discriminatory properties, (logic block 432). Optionally, rank-order curves 
are charted showing score g vs. „ for the best ^.vectors (logic block 434). 

5 Optionally, a measure of ejcpected performance for the next month is 

obtained by computing an average Ro using y-fold cross-validation over 
all available data, or over a single independent test set not used in any way 
during learning (logic block 436). There is a bias-versus-variance tradeoff 
between these two methods. The ejcpected performance is reported on a 

10 point basis (each value of a feature vector counts as one example), 

consistent with the way in which the real-time system is trained. However, 
block-basis (a whole epoch counts as one example) tests can also be 
reported to con:q)leinent and better comprehend Ihe assessment. 
Step 7: Given the optimal feature vector or artificial feature x, the probability 

15 function iKSjJx) is synthesized (logic block 414). Note that there are 

infinitely many classifiers that can produce the same optimal decision rule 
in Step 6, but there is only one correct probability function, obtained from 
Sj. and NS^ data independently of the classifiers which define how to act 
on the probabilities. A wavelet neural network (or other universal 

20 ^proximator suitable for online implementation in a miniaturized device) 

with a logistic sigmoid output unit is trained to compute jp(S^x) by 
presenting {0,1} targets and miniimzing the expected value of a squared 
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error loss fimction (logic block 438). This is done using training data 
while also monitoring error over validation data and choosing the network 
that minimizes the latter. With flie option of ts^jered-off data influence, 
training the probability estunator becomes a weighted least squares 
5 problem, where the contribution that each squared error makes to the loss 

function is wei^ted according to the age of the training datum. The 
probability estimator's bias term is corrected for prior probability 
mismatches (logic block 440). 
Step 8: The optimal monotherapy activation threshold that corresponds to the 
10 same decision rule as the nonparametric classifier of logic block 422 is 

obtained fix)m a formula and coupled to the probability function p(SjJx) 
(logic block 416). 

Further details and supporting theory of the above learning algorithm processing logic, 
required to understand and practice the learning aspects of the invention in its presently 

15 preferred and best mode, are now disclosed. 

Quality-of-life index. The qualily-of-life index (QOU) is a novel quantitative 
indicator fi-om 0% to 100% that accounts for not only the intensity, duration, and 
Jfrequency of seizures over a period of time, but also the severity of treatment and its 
associated side effects. According to the present invention, QOLI is an exponentially 

20 weighted moving average of the daily quality q, 98% forgotten over a period of a month. 
The daily quality is the complement of all seizure and treatment disturbances as 
quantified by their energy buildups over the past day: 
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where the first summation is energy of raw lEEG over all channels only during seizures 
(since the device saves sufficient data around all activations, the neuiolo^st in Step 2 
(logic block 404) can determine seizure periods, or it can be automatically determined 

5 with a class-S detector); the second summation is energy of the control eflfort over all 
chanaels during all times of activation, both normalized by the worst docum^ted case so 
that each summation does not normally exceed 100%; and ji^s + jSTi = 1 are relative 
weights for the seizure and treatment terms. Alternatively, the second summation is 
replaced by a convex sum of the percentages of treatment dosages with respect to their 

10 maximum values. Other standardizations fiom a normative database are also possible. At 
any given day ^, the ongoing QOLI is recursively computed as 
QOUi=JiQa+il-X)QOLIj_i withO<X<l 
Preferably, prior to prescribing the device, the patient's QOLI is assessed "by hand" to 
determine if surgery is warranted or as a starting pomt of reference. QOLIo is initialized 

15 as the weighted average or the mean value of previous gs, or as 50% if uiiknown. In 
expanded form, QOLIa ^s a series with a growing number of terms 
QOLI,=Ml-AfQa + 

which is a wei^ted average of present and past gs with exponentially decaying 
coefficients tiiat add up to 1 as d goes to infinity. To forget approximately 98% of past 
20 data over 30 daj^, the weights are made to decay tike 4 time constants of the natural 
exponential: (1 - ^) = = 0-8752. Therefore, X = 0.1248. A QOLI of 0% corresponds 
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to the worst documented case while 100% conesponds to living seizure-ftee and without 
any intervention. 

The eight-step learning procedure is conceptually an iterative solution to the 
optimization problem 



where x varies in the space of all possible feature formulas/algorithms, and CI varies in 
the space of all possible m^ings from x to multitherapy activation decisions 
(equivalently, the space of all possible class decision boundaries). That is, the system sets 
out to find the features and the therapy activation behavior that will maximize the 

10 patient's QOLI history weighted by some memory function |iQ over time. Obviously, the 
complexity of the brain alone precludes liie possibility of directly solving such problem. 
In the present invention, the optiniization is concretely tackled using lower-level 
classifier-based perfi)rmance metrics (described below), obtaining the multitherapy 
decisions automatically from the monother^y decisions associated with each time scale. 

15 In essence, the optimization problem is converted into multiple swapler problems, one for 
each time scale: 



where jE{s} is the expected value for the next month of the score 5 = 100 - rqVo 
associated with a feature vector/classifier combination. The space of feature vectors x is 
20 explored using heuristic search in the power set of a feature library, while the decision 



5 




max^{5} for each r, 
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boundaries Q are implicitly determined from dedsion rules known to be optimal 
(described below). 

Classifier-based performance metrics. Classifier-based performance metrics are 
used in the present invention during offline learning to rank the desirability of feature 

5 vectors and for researching new features. This is done in terms of classifier outputs as 
opposed to statistics of the features alone, equivalent to the way in which those features 
will be transformed online by the multitherapy activation decision logic 60. A note on 
semantics: detection is a special case of classification where the problem is dichotomous 
with a "class of interest" (usually with low prior probability) versus a '•plain noise" class. 

10 In this work, detectors in general are referred to as classifiers, but when referring to 
seizure detectors, the term is preserved if a distinction fsxm preseizure detection is in 
Older. It is assumed that fin- every feature vector obtained at the feature-vector sanq)ling 
rate, the classifier outputs a class decision. Integrating classifier decisions or 
implementing sequential hypothesis testing before arriving at a declaration (e.g., to 

15 elhninate *flickeringi" of decisions after a detection) can always be reduced to an 
equivalent '"final" classifier by simply using the outputs of the origmal classifier as input 
features of the equivalent one. In fact, those are particular examples of historical features. 
Therefore, there is no loss of generality in the framework presented hae. Classifier-based 
performance metrics have a theoretical definition involving mtegrals, probability density 

20 functions (pdfs), and class decision boundaries, however, their empirical counterparts are 
presented which are unbiased estimates based on counting. The classifier decisions that 
are collected fix)m an ejqperiment are tallied and organized into a conflision matrix: 
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Sr NSr 





Npn 







where and NSj, are labels for true preseizure and nonpreseizure classes, while Sr 

NSr 

indicate the classes declared by tiie classifier, j^cp = number of correct positives 
5 (preseizure class detections), jvc^ = number of correct negatives, = number of false 
positives (false alarms), and = mmiber of &lse negatives (preseizure class misses). 
Furthermore, 2^^= jvcp + = number of preseizure examples; - jvcn + Nfp ~ 
number of nonpreseizure examples; and ji^^t = +iV^ = iVcp + Nfh + iVcN + Npp ~ 
number of examples. The esiperiment is usually a validation test during training, or an 
10 independent test or one of many cross-validation tests during future performance 
assessment. The examples in an experiment can be defined on a point basis, where a 
feature vector counts as one example, or a block basis, where a whole signal epoch 
counts as one example. Results under the two bases are related: point basis and block 
basis are like high-resolution and low-resolution views, respectively, of the same 
IS performance metrics. The block basis is appropriate for reporting because it is easier to 
comprehend (e.g., "detection delay on a continuoiis seizure epoch" versus "error risk on 
random pieces of data"), but not for training the system. What the system is trained to do 
in real time is give a decision at every feature-vector instant, not wait until the end of a 
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block, using training feature vectors, not time-synchronized training blocks. Also, a block 
does not allow randomization within the block and severely limits the number of 
examples available for conducting validation tests. 

From the confusion matrix, several classifier-based performance metrics can be 
5 computed, preferably on a point basis, as shovm in Table 1 : 
TABLE 1 



Quantity 


Formula 


ProbabilUyof 
correct positive =^ 
Sensitivity 


P„=-f- = Sens 


IhrobabiUiyof 
false negative 




Probability of 

correct negative = 
Specificity 


"Spec 


Probability of 
false positive 




Selectivity 




Probability of 

correct 
classification 
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Probability of 
error 




Balance 




|Pc„-0.5|+iPa,-0.5| + l 



At any instant in feature time, ppp in the above table is the probability that the next class- 
NSj, feature vector sample (which for practical purposes is any of approximately a// 
samples) will be a false positive. Under the ftequentist view of probability, this is the 
5 avoage nvimber of false positives per feature sampling period. The more widely reported 
felse-positives-per-hom: (FPH) can be computed from ppp as the mmber of point-basis 
FPs expected in one hour: 

where is the feature sampling period in seconds. Again, this relation holds under the 
10 assumption that />(NSj) » 1. The FPH figure is appropriate for block-basis reports, but 
can be inisleading because of its dependence on 7^. For example, a seizure detector with 1 
FP per hour is considered acceptable, but it could be a failed system that outputs only 1.1 
decisions per hour. The quantity called selectivity in the table is an alternative definition 
of specificity that indicates how many of all the detected preseizures were specifically 
15 preseizures. The quantity called balance is 0 for the worst case when 0^,pcp) Q^>^) 
(1,0), and is 1 in the best case ^fAnsa pcn = Pcp • Prior probability estimates also follow 
from the confusion matrix as shown in Table 2: 
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Table 2 



Quantity 


1^'ormula 


Frior probability 
of preseizure 


■"'tot 


Prior probability 
ofnonpreseizure 





The above performance metrics and estimates are not all independent At most 
four equations are required to determine all entries in the confusion matrix, and thus all 
5 the listed quantities. Therefore, at most four of the above measures can be independmt 
(e-g-. Pcp> Pfp> Pc> Sel form one such set). Average detection delay Tq is another 
classifier-based performance metric that is impropriate for block-basis reports. However, 
it is not indqpendent of some of the point-basis quantities either. For example, it is not 
possible for a seizure detector to have long detection delays and at the same time high 
10 probability of correct clasafication (every feature vector during the delay period counts 
as a point-basis &lse negative). 

From the total probability theorem (or fiom the counts in the above tables), the 
probability of error can also be written as 

= P(E I )P(Sr) + P(E j NSr)P(NSr) 
= P^P(S,)+P„P(NSr). 

15 Thus, pe penalizes the two types of detector error (E^Ns and EPs) with relative weigjits 

that are the prior probabilities of the classes. In the context of the present invention, it is 
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usually not enough to penalize false negatives witib p(Sy), which is very small for the key 
time scales in the range of a few hours or less. False negatives, especially for seizure 
events, are considered intolerable. The error risic metric assigns risk fiictors ^ > 0 to the 
ecrora, so that their relative costs can be accounted for: 

5 =i'm^(S,)r™ +P^P(NS,)r„, 

where ^ is a risk factor associated with missing preseizures, and ^ is a risk £actor 
associated with declaring false positives. The relative sizes of these risk factors should be 
chosen as if assuming that the prior probabilities are equal p(Sy) = p(NSj.). The 
experienced practitioner can also make a judicious choice of the penalty weights n 
10 directly in the formula: 

Typically, missed seizures are considered so much worse tihan Mse alarms that TipN > »tpp 
(this overrides the effect of low p(S^, e.g., = 0.75 and = 0.25), but exactly how 
much can be based on the patient's untreated seizure frequency, forms of therapy, or 
15 other factors. Ultimately, the choice of risk or penalty factors should lead to long-term 
QOLI irnprovement. 

When the invention is practiced as a monotherapeutic device using only the 
seizure detector-controller gO-i/3oG)>6*-i/3o)« ^i^ ^sk performance metric may be 
sujBficient. This metric penalizes the allowance of seizures or detection delays (via point- 
20 basis FNs) and the unnecessary activations (FPs). However, when one or more 
predictors-controllers are involved, false positives during online operation become 
indistinguishaible from the success&l outcomes of therapy (the correct positives for 
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which seizures were prevente<^. That is, tiiere will be times when therapy is activated due 
to predictions and no seizure is recorded, but it is unknown whether a seizure was 
actually going to occur. This paradoxical condition arises in seizure-preventive systems 
(it is not an issue in strictly seizure-responsive systems provided that seizure ioitiation 
5 can be unequivocally confirmed) but was never addressed heretofore. According to the 
present invention, the information on whether online activations were felse or true 
positives is partially available through the higher-level QOLI metric, which penalizes all 
disturbances to the patient: seizures, and both necessary and uimecessary activations.. At 
times, it may be that a better solution, according to QOU, is to suppress therqiy during 

10 those ambivalent times. The only adverse effect of the FP/CP dilemma is that the size of 
the learning sets will be mitially reduced, or the labeling of classes will not be accurate. 
In the first scheme, all epochs containing ambivalent £alse/true positives (activations that 
did not lead to a confirmed seizure), some of which could have served ^ examples of 
preseizures and the rest as nonpresdzures, are excluded bcm. the learning sets. The 

15 sjratem will still leam to reduce FPs and FNs as usual ftom q)ochs before actual recorded 
seizures and fiom baselines. During offline learning, the EP/CP ambivalence does not 
exist because there is no therapy m the software simulations. However, if it is obsared 
that QOLI is worsening, it may be hypothesized fliat these cases were false positives and 
can be added as preseizures to the learning sets, until the index improves and stabilizes at 

20 a steady state. This may involve some oscillations. For example, if the system is 
operating in such a way that no actual seizures are recorded by the end of a period, then 
no new preseizure data will be available for updating the learning sets. This may imply a 
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satisfactory but not necessarily the best achievable QOLI. The system may then start 
reducing ther^y via FP hypotheses, sometimes temporarily worsening QOLI, until an 
optimnm level is achieved at equilibrium. In an alternative scheme, the ambivalent 
epochs are not excluded, and are randomly assigned Sy and NSy labels for use with the 
5 overall risk performance metric. 

The overall risk generalizes the error risk metric to include risk factors for correct 
classifications too: 

Ro =PFN-P(Sr>FN +-Pppi'(NSr)/B. + 

or in terms of penalties, 

Similar to QOLI, Rq penalizes all therapy activations and is generally the preferred 
classifier-based performance metric for influencing QOLI. There is no reason to penalize 
the system for correct negatives, which prolonged would lead to the ideal 100% QOLI, so 
jicN is always zero. Penalties would typically be Wpp = 'medium', JCpN " 'very large', = 

15 'small', and = 0. After presurgical evaluation, all online EPs and most CPs (the ones 
that prevent seizures) of a seizure predictor-controller will be ambivalent As previously 
explained, these epochs can be initially ignored because it is not known whether to label 
them as class or NSy for learning. However, Rq can be measured without difficulty 
firom the remaining epochs. In the alternative scheme, random hypotheses are made 

20 regarding the classes of epochs, and penalties are equally assigned to FPs and CPs. Thus, 
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penalties would be TCpp = 'between small and medium', tcjn - 'very large', step = 'between 
small and medium', and n^N ~ 0, 

Table 3 summarizes the last set of classifier-based performance m^cs. 
Tables 



Quantity 


Jb'ormnla 


False-posltives- 
per-honr 


•"l-teHods 


Average detection 
delay 




Error risk 




Overall risk 


^o='Pfn^fn(+^cn'?cn) 



Optimal decision rules. For flie same feature vector x, different classifiers can 
achieve different values of the performance metrics (e.g., probability of false positive). 
An optimal decision rule singles out the best classifier (f{x) in the space of decision 
boundaries Q, witii one or more performance metrics satisfying a desired set of criteria 
10 (e.g., minimum probability of false positive subject to a fixed probability of false 
negative). Actually, infinitely many classifiers can yield the same optimal decision rule as 
long as they imply the same class decision boundaries, but only some, prefenred 
according to the present invention, are amenable to corrections described hereinafter. 
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Recall from the eight-stqj procedure of Fig. 4 that classifiers c(x) are used in the present 
invention during offline learning to facilitate the search for decision rules which are later 
implemented online by means of probability estitnates and thresholds as in 
Fig. 2. 

5 Six optimality criteria are described below: maximum likelihood, minimum error 

(maximum a posteriori), Neyman-Pearson criterion, minimum error risk, minimax enor 
risk, and minimum overall risk. The achievement of 100% detection rate with minimal 
false positives (a Neyman-Pearson type of criterion) is commonly cited in the prior art as 
the ideal for a seizure detection system. This is an excellent criterion for researching new 

10 feature libraries. However, with automatic treatment involved, the ideal must also 
consider control effort/side eflEects. In addition, as will be seen later, the 100% CP 
constraint shifts all the "burden" away fix>m the classifier and places currently unrealistic 
demands (for prediction) on the features. Although all the presented optimality criteria 
are available for selection by die authorized caregiver during learning periods, minimum 

IS overall risk is currently the preferred cooipromise for practidng the invention with an 
existing feature library if one or more seizure predictors-controllers are involved. 

Maximum likelihootf The likelihood that a feature vector belongs to class C, is 
simply the "height" of the class-conditional probability density fimction (PDF) ^^(xlC,) 
when evaluated at x. The maximum-likehhood estimate of the class is 

20 c(x) = argmax {p(x 1 S^),p(x | NS^)} • 

C,6{S„NSr) 
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The above equation simply says, "output the argument for which ^(x|Cp is maximum, 
as C. takes on values from the set of classes {S^NSy}." In tenns of a threshold on the 
likelihood ratio, the decision rule declares class if 

_£CLjSr)_>i 

5 Recall that PDFs are greater than or equal to zero and integrate to 1 in R", but each PDF 
can have a different maximum height, and each height can be between zero and infinity. 
Also note tiiat the likelihood ratio can vary from 0 to oo- 

As will become apparent from the following discussion, for a given feature vector 
X (as a set of feature formulas/algorithms, not numbers). Hie maximum-likelihood rule 
10 solves the problem 

a 

giving equal weights to the errors indqpendently of the prior probabilities of the classes. 
In terms of a threshold on the posterior probability, llie decision rule declares class Sy if 

P(Sr|x)>P(Sy) • 
15 Note that this threshold is very small if Sj. is a rare event. 

Minimum error (maximum a posteriori)' * given feature vector, the problem 
to solve is 

imnPg • 

With a single binary {0,1} classifier output and {0,1} target codes for the classes 
20 {NSy Sj,}, the problems of minimum mean absolute error and minimum mean squared 
error yield the same result as minimum p^. The minimum-aror classifier is the maximum 
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a posteriori estimator of the classes. That is, a classifier makes, on average, the least 
number of misclassification errors if it always assigns x to the class it most probably 
belongs to: 

C(x)= ssgmasL{P{Sj. |x),P(NSj. |x)} , 

5 where p(Sy|x) is the posterior probability of class given the observation x, and 
p(NSjjx) = 1 - p(SjJx) is its complement. In terms of a threshold on the posterior 
probability, the decision rule declares class Sj. if 
P(S,|x)>0.5- 

Note that the 0.5 threshold does not imply that the classifier can "easily declare 
10 simply because half of the interval [0,1] conesponds to that class. If j3(Sp is very smail, 
it can be extremely difficult to see jP(Sy|x) reaching 0.5 (the few times when x becomes 
imcompatible with the NSy class). On the other hand, for large j, the prior probability is 
close to 1, and the probabiUty fimction will most often exceed 0.5. The prior probability 
information, which is built into the pCS^Jx) fimction, is what predisposes the minimum- 
15 error classifier to declare "easily" or "hardly." 

To obtain the fimction /<S ^jx) from data by brute force, one would shoe the input 
space into a number of hypercells and count the fiequencies of Sy when x was obsared 
in eadh cell. However, it is more practical to start off fix>m Bayes' theorem: 
_ ;7(x|S,)P(S,) 



p(x I S^)P(S,) + pix I NS,)P(NS,) 
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This formula does ^ot ^ itself ameliorate the well-known "curse of dimensionality," but 
it leads to a convenioit simplification when applied to mioinium error classifiers, deals 
with fhie more informative class-conditional PDFs, and allows the derivation of 
corrections for data that do not reflect the true prior probabilities. Applying the formula 
5 on the complement class, p(NS Jx) = p(x|NSy)p^Sy)/j,(x), note that Ihe denominator is 
the same, so only the numerator is needed to decide the class: 

C(x) = argmax {p(x | S,)P(S,),i7(x | NS^ )P(NSy)} • 

C,e{Sj.,N5r) 

The class-conditional PDFs scaled by their prior probabilities, p(x|Sy)/>(Sj,) and 
p(x|NSy)p(NSy), will be referred to as "Bayes numerators." Class Sy is declared 
10 whenever the Bayes numacator for Sy exceeds that of NSy. In terms of a threshold on the 
likelihood ratio, 

;>(x|S,) P(NS,) 

Note that the likelihood ratio on the leit-hand side can be estimated independently of 
prior probabilities (each class-conditional PDF comes from data within the Sj. or NS^, 

15 universes, without any regard to their proportions). The right-hand side is an easily 
estimated constant threshold that is between 0 and oo- For small p(Sy), the threshold on 
the likelihood ratio will be relatively large (e.g., 199 for p(S j) = 0.005). This is consistent 
with that fact that the small prior probability gives small weight to misclassi^g Sp 
making the nuiumum-error classifier ''reluctant" to declare S,. Also note that in the 

20 special case of equal prior probabilities for all the classes, the roinimum-error and 
maximum-likelihood rules become equivalent. 
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Neyman-Pearson criterion- The Neyman-Pearson criterion is a decision rule for 
detectors that usually refers to choosing a constant filse-alann rate while minimizing the 
&lse-negative rate. In the present invention, the optimality criterion of interest is 
(swapping the classes) minimum probability of false positive subject to the constraint of a 
5 fixed probability of false negative. For a given feature vector, the problem to solve is 

a 

It is not possible to give an expUcit formula for the decision rule because the decision 
boundaries vary with the particular pair of class-conditional PDFs, but the answer will 
still be a threshold on the likelihood ratio (otherwise, the system would give inconsistent 

10 rules such as declare Sy if p(SjJx) > 0.8 or < Geometrically, the fixed FN "area" 
(really volume in dimensions) below p(x|Sj.), which covers the region q^^ c where 
X is declared as nonpreseizure, should be distributed (disjointly if necessary) so that the 
FP area below ^(xjNSy), which covers the "complement" region where x is declared 
as preseizure, is as small as possible. This last area is usually in the "tails" of ^(xINS;). 

15 However, in systems such as seizure detectors where the FN area (p^) is forced to be 
almost zero, basically covers the whole feature space R». Therefore, the only 
sensible way of reducing FPs under flie constraint of 0 FNs is to look for different, more 
separable features. The problem to be solved becomes 
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With features that do not overlap too much, can cover just enough space to satisfy 
the hi^ sensitivity requirement, while at the same time leaving room for to produce 
a reasonable specificity. 

Minimum error risk- * givai feature vector, the problem to solve is 
5 nMni?E • 

Co]iq>ared to the minimum-error rule, the relative weights for the conditional FDFs 
change from p(Sj) and p(NSy), to p(Sy)^ and pQ^S^. The optimal decision rule 
becomes 

C(x)=aigmax{p(x|Sj.);rpN.i?(x|NS,,);rpp} • 

C,6{Sy,NS,} 

10 In tenns of a threshold on the likelihood ratio, the decision rule declares class Sj, if 

jp(x|S,) ^ P(NS,K .;r^ 
/^(xlNSy) P(Sj.)r™ 

In terms of a threshold on the posterior probability, ihe decision rule declares class Sj, if 



As in the maximum-likelihood rule, this threshold can be very small if p(S^) is very 
IS small. For example, with p(Sj.) = 0.005, and even more emphasis on avoiding FNs than 

maximum likelihood: TtpN 0-75 and Tipp = 0.25, the threshold is 0.0017. 

Increasing the risk of FPs raises the probability threshold, reducing the system's 

propensity to declare Sj. Sunilarly, increasing the risk of FNs lowers the threshold, 

making the system more sensitive. If instead of the above thresholds on the likelihood 
20 ratio or the probabOity, some other constant is used, the classifier's operathig point 
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moves along the '"receiver operating characteristic" (ROC; />cp Pep) curve, trading off 
the ability to detect preseizures for the abihty to detect nonpreseizures (aad losing 
optimality as initially defined by the perfonnaace metrics, of course). Changing these 
simple thresholds on the likelihood ratio or the probability implies ttiat the classifier's 
5 complex decision boundaries also change in the „-dimensional space of features. Except 
for academic special cases, separately thresholding one-dimensional features, as 
commonly found in prior art systems, creates suboptimal decision rules. For two or more 
features, the resulting suboptimal decision regions are piecewise rectangular, with 
boundaries parallel to the feature axes. 
10 As a degenerate special case, a feature vector may have posteriors p(S jjx) that are 

always equal to the prior jp(Sj.). This in^lies that the feature vector behaves identically 
under preseizure and nonpreseizure conditions (its class-conditianal PDFs are the same 
p(x|Sj,) = j^xjNSy)). The optimal classifier would be the constant-output classifier given 
by 

15 C = argmax {PiSr)r^,PimM • 

This will be the constant 'NSy for diort-tenn prediction horizons (unless the risk 
factors of&et this), and 'S,.' for longer 71 In t^s of the online inq)lementation with 
probability and threshold as in Fig. 2, Hoe input to the activation decision logic 60 is the 
constant p(S j.), and so the output wiU also be a constant: either 'Intervene forever" or "do 
20 not intervene ever." For prediction horizons in the range of greatest interest (firom -2 
seconds to a few hours), p(Sj) is small, and the decision could be to remain off all the 
time. For longer horizons j*, when the patient's natural int^eizure period becomes 
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comparable to y, p(S^ qiiicMy approaches 100% and the decision will be to intervene in 
purely open loop, just like traditional medication regimes. According to the present 
invention, tbis worst-case scenario can only occur in the rare instance of a patient for 
whom no feature can be found to separate seizure states. In extensive research, no patient 
5 has presented this difficulty to date. 

Mmmax error risk' For a given feature vector, flie problem to solve is 
minlmaxJIn} > 

considering p(Sj,) as an unknown (anywhere betweai 0 and 1). The error risk metric can 
also be written as 

For a given set of decision boundaries, this equation laas the form of a straight line 
segment in the variable p(Sy), with startmg point at the coordinates Cp(Sr)>ite) = 
(0,pjjp^) and ending point at {\,pmmd- If decision boundaries are chosen such that 
Pfn/tn > PFPrn"? tte slope of the line segment is positive and the worst (maximum) jje 
15 occurs at the ending point. If decision boundaries are chosen such fhat pmrm < Pmvp> 
the slope of the line segment is negative and the worst rs, occurs at the starting point. The 
more inclined the slope (positive or negative), the worse the maxunum jjg becomes. The 
minimum of this worst-case occurs when decision boundaries are chosen such that 

20 In this case, the slope of the line is zero and jjg becomes independent of p(Sy), Note that 
without risk factors, this criterion is equivalent to maximizing the balance performance 
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metric. It also minimizes the maximum error rate pe» or equivalently, minimizes the 
maximum of the two types of detector errors: FNs and FPs. 

Mininam overall risk- * givea feature vector, the problOTa to solve is 

minl?o- 
The optimal decision rule is 

C(x) = argmax{i?(x|Sr)[;rjN -^cel 

C,e{Sr^Sr» 

;?(x|NSy)[;rpp-;rcN]}. 
In terms of a threshold on the likelihood ratio, the decision rule declares class Sy if 

p(x I Sr ) ^ P(NSr)[?-pp - ^CN 1 ^ [^pp - ^CN ] . 
i?(x I NS J. ) P(S J, )[r^ - ] [n^^ - jTcp ] 

In teams of a tiireshold on the posterior probability, the decision rule declares class if 
1 



P(S,lx)>- 



1+^ 



Once again, this threshold can be very small if p(Sj) is very small. For example, for 
p(Sj.) = 0.005, and TCpp = 0.5, JipN = 0.95, jicp = 0.15, and jicn == 0, the threshold would be 
0.625 on the likelihood ratio or 0.0031 on the posterior probability. For the random 
hypoUieses alternative scheme with %pp = 0.25, TtpN - 0>95, ticp = 0.25, and tccn = 0, the 
threshold would be 0.3571 on the likelihood ratio or 0.0018 on the posterior probability. 
Recall that these low thresholds do not imply that therapy is "too easily" activated. For 
these cases of very small p(.S^, the time plot (m the personal computer where tiie 
software interface runs) of p(SjJx) remains close to zero most of the time until abrupt 
changes occur at times of declaration. For better visualization of small changes close to 
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zero and the transition periods, the logarithm of p(Sy|x) can be plotted instead. When the 
device itself is built around a small microprocessor (e.g., an 8-bit system), the probability 
estimators for short fs should directly output )Jk-law or other compounded values, so that 
the majority of digital codes are assigned to the small (and more frequent) values of 
5 p(Sj]x). 

Some of the above concepts may seem counterintuitive (e.g., decision rales with 
low thresholds, or purposeful misclassifications) because everyday experience makes 
individual probabilistic thinking "hnear", i.e., individuals feel comfortable with class 
given X if tiiey have seen x predict C,. more often than not (maximum a posteriori rule), 

10 with a subconscious idea of the classes bemg essentially equally likely. The following is 
a simplistic pedagogical aid to understand some of the issues by analogy. Suppose Ihere 
is a bag full of green balls and a few orange balls. Only one out of every 200 balls is 
orange. Thus, p(G) = 0.995 and i>(0) = 0.005. Some balls have a daric spot feature that is 
highly indicative of "orangeness": when a dark spot is seen, 9 out of 10 times the ball is 

15 orange. Thus, p(0l5?=l) = 0.9. Despite this higih selectivity, the sensitivity of the "spot 
test" is lower: only 60% of all orange balls have the dark spot. Thus, /)(x=l|0) = 0.6. 
Several other probabilities can be derived, for example, the prior probability of a spot 
p(x=l) = p(x=110)p(0)/p(0|x=l) = 0.0033, the prior of no spot p(x=0) = 1 -p(x=l) = 
0.9967, a spotless orange ball p0e=0\O) = 1 - p(x=l|0) = 0.4, a green ball with a spot 

20 p(x=l]G) = IXx=l) - p(x=l |0)p(0)]/p(G) = 3.35(10)"^, etc. Suppose a curtain is draped 
in front of the bag so that balls can be drawn without seeing their color. The minimum- 
error rule based on a priori knowledge is to always declare G. If however, similar to the 
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prisoner's dilemma, the penalty for missing a green ball is moderate but death for missing 
an orange ball, then obviously the decision rule changes to always declare O. Now 
si^ose a black-and-white camera and monitor are set so that the color cannot be 
seen, but the dark spot on a ball can be seen if it exists. For every spotless ball draAvn, 

5 compute (or obtain &om a look-up table) the probability that it is orange as p(O|x=0) = 
p(x==0|O)p(O)//j(x=0) = 0.002. If the dark spot is seen, this changes to />(0|x?=l) = 0.9. 
Therefore, the time series of p(0|x) has the form 

{.002, .002 002, .9, .002, .002, ... } 

and the minimum-error classifier will only occasionally declare O (whenever it sees the 

10 spot). This will be accomplished with an error rate = PPNiKO) + PfpP(G) = 
p(x=0|O)p(O) + p(3F=l|G)p(G) = 0.23%. Because the feature x is binary in this example, 
the minimum-risk decisions are the same as minimum-error unless the threshold on 
probability is placed below 0.002, which would produce a constant-output O. This 
increases sensitivity from 60% to 100%, but the error rate jumps to 0(0.005) + 1(0.995) = 

15 99.5%. With continuous features, the time series of the probability function is not just 
two-valued, for example, 

{.0015, .0018, 0017, .2, ,9, .0014, ... } 

so a low threshold that minimizes a risk is not necessarily the extreme risk case/constant- 
output classifier. Note that an optimal choice of threshold cannot do more than yield the 

20 best-achievable performance metric for a given feature vector. If no value of threshold 
can produce a satisfactory performance then, as previously exphxDBd, the burden is on 
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researching for improved features (e.g., p&haps the distributioii of daik spots on a ball, 
dark streaks, etc.). 

To complete the analogy with a hypothetical plication, consider the green balls 
as nonpreseizure observation windows (defined by the most recent data point in the 
5 window being class NSy), and the orange balls as preseizure observation windows. The 
dark spot feature is a prodrome 90% of the times seen preceding a seizure. Only 60% of 
the patient's seizures are stereotyped, so these prodromes are not seen in the remaining 
40%. Ten percent of prodromes do occur dirring nonpreseizure times. The colors, curtain, 
and black-and-white camera/monitor symbolize the fact that all the underlying 
10 phenomena giving rise to seizures in a brain cannot be fiilly observed, but better results 
can be achievable than provided by open-loop seizure control by measuring features x 
&om electrophysiological signals or other correlates of seizure, and closmg the control 
loop based on them. 

Nonparametric dossiers. Parametric classifiers rely on assumptions regarding 
15 feature distributions, from which a few parameters are estimated such as mean vectors 
and covariance matrices in a Gaussian model. Classifiers based on universal 
approximators such as neural netnrorks and fuzzy systems have a theoretically infinite set 
of parameters and learn firan examples without relying on assumptions about feature 
distributions, but typically consume much of the computational resources during training. 
20 Nonparametric classifiers have basically no parameters and no learning rules. Training is 
extremely fest because the entire training data set is simply memorized (stored). The 
savings on training more than compensate for the CPU tune and memory expense 
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incuired during application of nonparametric classifiers, so they are used ia the present 
invention to expedite feature optimization, which may involve hundreds or thousands of 
tests. 

The ^-nearest neighbor (kNN) classifier is a nonparametric, nonlinear classifier 
5 widely accepted for benchmarking. Gttven an input pattern vector, the kNN searches for 
the k most similar (closest in Euclidean distance) vectors in tiiie training database, and 
declares the corresponding target class by majority vote. A discri m inant function score 
Dj(x) can be derived for each class by counting the "popularity" of each class among 
the ;^ classes: 

10 ^'^""^ = ' 

where vj„ is the vote (0 or 1) made fijr C. by the nearest neighbor. These popularity 
scores can only range flrom 0 to a^i^i ^l^eir sum must equal k. For example, for = 3 and 
two classes, S^may receive 0 votes while NSy receives 3 votes (or 1 and 2, or 2 and 1, or 
3 and 0). The shape of these kNN discriminants directly (though crudely) proximate 

15 the shape of posterior probabilities p(Cjx). 

The method of Parzen windows ^lied to classification, or kernel discriminant 
analysis, is a classical technique in statistics reinvented as the probabilistic neural 
network (PNN). With proper corrections, the PNN or its variants caa approach optimal 
decision rules as the number of training observations ji^^ goes to infinity. Typically, a 

20 radially symmetric Gaussian node (Parzen Avindow) is centered at each training input 
vector pattern p.. The width of all tiiese windows is controlled by a common bandwidth 
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or smoothing parameter. Here, we use an inverse bandwidth parameter = -^2) S 
where a is the standard deviation of the Gaussian kernel. For jyt classes, the output 
targets are encoded in 1-of-^o binary vector fiMmat [^^ tzj ... fj^^, where only one of 
these f^'s is equal to 1 at the position 2* that indicates class C^, and all otiiers are zero. 
S Criven an input vector x, the discriminant fiinction for each class is computed as 

M 

This is the sum of all the Gaussian "bells" that were obtained Scorn class C^, evaluated at 
X. The shap^ of these discriminants are like class-conditional histograms: they 
approximate the shape of Bayes numerators (class-conditional PDFs scaled by prior 
10 probabmties)p(x|C,)p(C,). 

In an alternative P!NN, the discriminant function for each class is 




where is the number of training vectors belonging to class that is, 

15 Note that division by produces discriminant function surfeces which are the average of 
all kernels for each class. These surfaces crudely approximate the shape of class- 
conditional densities. The maximum value that the estimated curves can attain is 1 (which 
happens if all examples of a class are identical). AVhen the PDFs have different spreads, 

the heights will be different. It is seen that the averaging makes this PISN blind to prior 
20 probabilities. To illustrate, if tiiere are one million examples of one class and only one 
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example of the other, then dividing by jvi = 10* and JVi ~ 1 bounds each estimated curve 
by a maximum value of 1, irrespective of the constituents' proportions. 

As the inverse bandwith parameter ^ of PNNs becomes large, the Faizen windows 
become very "thin," so that only the training bell closest to the input vector is 
5 significantly activated. The resulting decision rule is therefore identical to the nearest 
neigjibor rule (kNN with k = I). In practice, PNNs should not be used as a kNN, 
however, because the extremely small values of the Gaussian for far inpvA patterns cause 
occasional numerical problems such as underflow and division-by-zero. 

Before going through a competitive layer (maximum selector) to declare a class, 
10 tihie discriminant functions of these classifiers can be convexly normalized so that their 
sum across classes equals 1: 

This convex normalization does not change classifier decisions in any way (the class with 
maximum will also have maximum 5), but at least for the first PNN, the values can 
15 serve as estimates of posterior probabilities of the classes: ^(x) « p(Cp^)- 

classifiers to which this transformation of nonnegative discriminants is applied, the ^ 
approach either 1 or 0 as ||x|| _^ oo> so tiiese normalized disraiminants resemble 
continuous posterior probability functions in the domain extremes, where at least one of 
the classes can be classified with absolute certainty. 
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Corrections to classifiers for optimal decision rules- Since seizures are relatively 
infrequent, preseizure feature vectors will exist with very low a priori probability p(S j) at 
the most useful fs. Traimng and/or testing classifiers/predictors witii wrong prior 
probabilities induces a variety of distortions that, if not corrected, lead to biased 
5 assessments of suboptnnal systems. For example, a system trained in the laboratory using 
the distorted prior probabilities implicit in the lEEG archives will not yield the highest 
achievable accuracy, or may not jdeld the lowest achievable risk with the fixed weights 
originally intended, etc. In addition, if the system is tested in the laboratory also with 
wrong prior probabilities, some of the metrics will look better (or worse) than they will 

10 actually be in real Ufe on the patient's device. It is known that correction factors can be 
applied to the discriminant fimctions of maximum a posteriori neural classifiers, after 
t raining, so that systems will perform as though they were optimized with the true prior 
probabilities. This correction technique can be extended to other types of decision rules. 
Corrections can also be derived for thresholds, and can be used to change the optimality 

IS definition of a given classifier. In addition, corrections can be derived for the posterior 
probabilities as explained in a subsequent section. The consequences of ignoring these 
corrections and much of the following methods are not well known judging by the pattern 
recognition literature, where in fact, the mistake is routinely found. 

The a priori probability of preseizure vectors estimated firom the proportion of 

20 preseizure vector examples in the training data set, 
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will typically not reflect tte true frequency of occurence p*™*(Sj,) in the continuous time 
line (seizures are accelerated during hospitalization, there are time gaps in the recordings, 
etc.). The true prior probability is estimated over patient monitoring periods longer than 
the hospitalization and under steady-state drug, hydration, sleep, and other conditions 
5 (though not longer than about a month, when long-term behavior can be considered non- 
stationary). From Bayes' numerators in the error and risk decision rules, it can be seen 
that this will cause a mismatch between an optimal classifier learned under the apparent 
proportions, and what the optimal classifi^ would have been for Has true environment 
(the correct ROC operating points will be different). This issue must be carefully 

10 addressed, particularly when the prior probability for one class is much smaller than for 
the others. In an attempt to incorporate the true proportions, it is impractical and 
counterproductive to train a classifier with millions of NSy vectors and only a few Sy 
ones. Training a classifier with such an unbalanced proportion of exan]|)les would 
obscure the very patterns it must pay attention to. The correction methods presented here 

13 simultaneously account for the prior probability mismatch aft^: training has taken place, 
and allow the practitioner to use any convenieDt proportion of examples for training 
classifiers or probability estimators. 

Three types of classifiers are defined based on the type of discriminant functions 
£)j(x) they construct. Type-B classifiers have an internal representation of the shape of 

20 the Bayes numerator for each class, that is, the /).(x) equal j|,(x|C.)p(C.) or are 
proportional to them by a common fector across classes. Equivalent classifiers that 
monotonically transform the Bayes niunerator score such as with the negative or the 
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logarithm, fiom which the score can be recovered, would still qualify as type-B. Bayes 
quadratic classifiers for jointly Gaussian features and PNNs in the limit iVobs -> «> are 
examples of type-B classifiers. Type-L classifiers have an internal representation of the 
shape of the likelihood score for each class, that is, the Di(x) equal the PDFs p(x\CD or 

5 are proportional to fliem by a common factor across classes. A maximum selector of area- 
normalized histograms and the alternative FNN in the limit Nota w are examples of 
Type-L classifiers. Type O are all other classifiers vMch can yield optimal decision 
rules but do not attempt to approximate discrimuiant functions related to p(x\Ci) or 
jj(x|C<)yP(C/). Fuzzy classifiers and kNNs are examples of type-0 classifiers. 

10 A type-B classifier that was optimized using wrong training prior probabilities 

learns the equivalent of the discriminants D^(x) = Xx|C;)P™*{Q), The decision rules 
affected (when implemented as argmax{-}) are min ?b. and min and min Rq with risks 
originally fixed through risk &ctors r. Each discriminant can be corrected by rescaling 
with correction fectors 



corresponds, within a factor, to canceling the wroB^ prior probabilities and replacing 
them with the true ones: 



15 



SO that applying these numbers to the discriminants 



20 
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Since PDFs are independeut of prior probabilities, type-L classifiers are not 
inherently affected by prior probability juismatches. They are ideally suited (when 
implemeated as argmax{-}) for the maximum-likelihood decision rule, Neymaa-Peaison, 
minimax Re, and for min Rz and mm Rq with risks originally fixed through penalty 
5 factors ic. The same multiplicative correction technique can be used for changing the 
optimality definition of a given classifier. For example, if we are given a maximum- 
likelihood type-L classifier, which minimizes O.SPfn + O.SPw, and later decide it should 
have been Rb, = 0.75Pfn + 0,25Pfp, then we only need to reemphasize the discriminant 
fimction for class with 

10 D^OO = D^ix)^, 

while similarly deemphasizing tiie complement class 

Of course, type-L classifiers would be affected by prior probability mismatches if they 
are used to implement the decision rules mentioned above for type-B classifiers by 

15 substitution mto tiie foimula A'^(x) = />(x|C/)P™*'(Q). It should be noted Hast all 
classifiers/decision rules are affected by prior probabiUty mismatches when they are 
implemented as a threshold on the posterior probabiUty P(Q|x) as in the presentiy 
preferred online implementation. The online implemetrtation of direct estimates of PDFs 
by type-L classifiers is considered infeasible beyond about five feature dimensions. 

20 Type-0 discriminant functions can have arbitrary shapes and give optimal 

classifiers tuned for the apparent prior probabilities P™^(Ci), as long as their 
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intersections produce the same decision boundaries or thresholds as those produced by 
p(x|Cf) or /?(x|C<)P^^(Q). However, because they are not required to proximate the 
shape of these likelihood scores or Bayes numerators, there is no guarantee that the above 
correction techniques will result in type-0 classifiers that behave optimally under real- 
5 life conditions. 

In the kNN, even Ihough its "popularity" score discriminiints do not have the 
shape of j:?(x|C/)P™'(Q), the correction factors are directly multiplied to those scores 
prior to entering a competitive layer. In the PNNs, the correction fectors are multiplied to 
the discriminant scores for each class prior to convex normalization and processiag by a 

10 competitive layer. In our research, kNNs have demonstrated the property that they can be 
very robustly corrected for prior probability mismatches, ttiough only partially, using the 
methods outlined here. This may be related to the &ct that if a class is rare, then it is 
mherently less likely to be represented m a nearest neighbor. As types B and L, PM<Ts can 
be fully corrected for prior probabilities or new optimality, though the practitioner is 

15 warned that the sensitivity on the b parameter can make this difBcult. The effect of an 
madequate b can outweigh the classifier's rectifiability to the pomt that no correction is 
achieved. 

Feature subset selection. In building a rich feature Ubrary based on a 
combmation of mtuition, brainstorming, knowledge of the field, and trial-and-error, no 
20 effort is (or should be) made to ensure that the Nf features are uncorrelated. Some of the 
features may even turn out to be completely irrelevant for the particular prediction task. 
Furthermore, the use of all features in this library is associated with a large computational 
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burden on the learning and execution process. Therefore, the smallest feature subset (w- 
vector) that meets a performance objective must be found. Cover has constructed an 
example with binary features where the best feature subset was actually the "worst- 
looking" according to any greedy techniques because independent observations (e.g., 
through time) of a "bad" feature can yield better accuracy than independent observations 
of a "good" one. For a feature hbraiy of size N{ <, 30 and feature dimensions » ^ 5, this 
selection problem can be solved by exhaustive search 174,436 evaluations), however, 
in general, heuristic searches are required despite Cover's paradox. For a fixed n, there 



[nj n\iN,- 



-— possible ways of choosing n-dimensional feature vectors out of 
n)l 



10 the N{ features, n < N{, which adds up to 2^^- 1 feature vectors of all possible sizes in the 
power set of the feature library. These numbers can grow so large as to preclude the 



possibility of exhaustive search. For example, |» | ^ j> and j ] are on the order 

of 30(10)^ 75(10)^ and 1.7(10)", respectively. 

One of the simplest strategies to deal with this espon^tial explosion is the 

15 forward sequential search. Each of the Nn features are &st individually scored. The best 
one is picked and made a permanent part of what will become a "growing" feature vector. 
Among the still unused Nf-1 features, the feature that works best in conjunction with flie 
first one is found and added to the feature vector. Then among the still unused Nf-2 
features, the feature that works best in conjunction with the previous pair is found and 

20 added to the feature vector. The process is iterated until n features have been chosen 
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(prefixed or until scores reach a desired level). This technique is a greedy algorithm (the 
best 1-vector plus another may not be the globally best 2-vector, and so on), but requires 
only nNi- »(k-1)/2 score evaluations. The numbers in flie previous example dramatically 
drop to 235, 490, and 955, respectively. It has been empirically fomd in the prior art that 
5 the difference in classification performance between forward selection and dynamic 
programming (a much more exhaustive method) is typically 4% or less. However, 
because the candidate feature vectors generated by this method are nested sets. 

However, to reduce the probability that randomly chosen PDFs can defeat 
forward (or backward) selection. Steams' (NaMMco) add-on-knock-out algorithm 

10 (originally called "plus m, take away n") is the preferred coiKqputationally tractable, 
heuristic search method for feature selection. The best A^'add (e-g-» 2) features are first 
added one-by-one as in forward selection, but on the next round, the worst Nko (e.g-j 1) 
are eliminated as in backward selection. This alternating process is iterated until n 
features are obtained. The method produces a sequence of candidate feature vectors that 

15 are not necessarily nested, and typically involves about five times the computational 
effort of forward sequential search. This is still well below ttie typical increase required 
by dynamic programming. 

Artificial features. The specification of distinguishable features is the most 
important key to intelligent sensing. Given a set of features, it is known how to create 

20 optimal classifiers. The reverse problem, specifying a set of features given the classifier, 
has been limited in the art to feature subset selection as described in the previous section. 
The present invaation optionally furflier optimizes the selected feature vector using a 
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computationally intensive procedure to create a genetically found, neuraHy computed 
(GFNC) artificial feature which, by definition, matches or exceeds Uie performance of the 
origmal feature vector. For fixed classifier congpitational resources, the discrimination 
task is greatly facilitated with a single GFNC itiput feature compared to multidimensional 

5 inputs carrying the same, but more 'ludden" information. 

In Uiis fi'amework, features are rq)resented and computed by networks. Since a 
feature, as defined in the present invention, is obtained fixnn a formula or algorithm that 
maps a raw input set into a scalar, then a suitable neural network is capable of learning 
and implementing the mapping. Therefore, GFNC features are obtained firom the outputs 

10 of feedforward netwoiks, or &e stable equilibria of recurrent networks, and can mimic 
conventional features or be completely novel artificial features. It is envisioned that with 
fiiture increases in computational capabilities, artificial features will be abstracted 
directly fixnn raw data, which are thus not limited to infomiation generated fixnn a finite 
list of features. The networks are represented genotypically as binary strings 

15 (chromosomes) and are considered as individuals in a genetic algorithm (GA) or other 
evolutionaiy algorithm. Table 4 highlights the contrasting characteristics between 
conventional and artificial features. 



Table 4 



ConventioDal 
Features 


Artificial Features 


sequential 


parallel 


Von Neumann 


neural computer 
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computer 




progranuaed 


learned 


combinatorial 


inductive 


ad-hoc 


optimized 


based on intuition 


based on data 



As an example of how a GFNC feature can be created, consider the problem of 

decidiBg whether two random vectors are parallel in the plane. Given the starting points 
and increments of the two vectors, (jci^i,A:ci,Aj;i) and (jC2j'2»Ax2,A)^, it is desired that a 
5 decision structure output 1 for parallel, and 0 for non-parallel. It will be instructive to 
note that the starting points and the relative size of the increments are irrelevant, and from 
knowledge of analytic geometry, that the absolute value of the cosine between the 
vectors: 

10 is an optimal feature for this task, with 0 meaning orthogonal, and 1 meaning completely 
parallel. Sets of random training and testing input vectors [xi^uAx],Ayi^2:y2,^2Ay2] 
containing parallel and non-parallel classes are generated. An artificial feature is 
inoplemented by an algebraic network, with topological connection symmetries, with 
single-input nodes that represent one of the unary mathematical operators {/(•), ("f,^, 

15 |»|}, where /(•) is the identity operator, and with two-input nodes that represent binary 
operators {+, -, *, /}. The latter type of nodes can be easily generalizied to «-aiy 
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Operators. Conventional neural networks can also be employed but would be much larger. 
The output of the network is a rather general nonlinear feature of the original mpvts. Raw 
training and testing patterns go througji the network in order to create a single-input KNN 
classifier that is trained on the artificial feature, and to compute the Pc accuracy metric on 
5 the test set. 

The genetic algorithm (GA) uses the Pc metric as a fitness function. Each 
candidate solution is an artificial feature/formula. In the GA's chromosome for each 
individual, the node operations can be rqnresented using two binary genes (because there 
are 2^ possible operators per node), for a total of 44 genes per individual. This 

10 chromosome is decoded into the netwo± that represents an artificial feature. The 
evolution of artificial features in the population progresses by genetic crossover and 
mutation. For a network with 30 nodes, with four possible operators per node, and 
allowing the topological connections of all itiputs to vary as {0,0} (no coimection), {0,1}, 
{1,0}, and {1,1} (two connections), the problem is a combinatorial search in a 38- 

15 dimensional space containing 4^^ « 7.6(10)^ candidate solutions. The age of planet Earth 
is estimated at 3.5(10)' years. Therefore, if it took only one second to carry out each 
fitness function evaluation, it would take on the order of 700,000 lives of the Earth to 
fiilly enumerate the solution space and guarantee optimality. 

In ejqperiments with the above problem, using relevant inputs, a feature never 

20 thought of by a domain expert was quickly found to achieve the same zero-error 
performance as the common-sense anal3^c feature. When mixing relevant and irrelevant 
inputs, a significant increase in accuracy (fijom 55% to 84.2%) was possible by evolving 
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the artificial feature firom tte raw set of eight inputs within computational resource and 
time constraints (a few hours in a personal computer). Three of the four delta features 
(Axl, jSx2, and Ay2) remained connected, while three of the four starting-point features 
(xi, X2, and ji) were disconnected. Hiis suggests that the GFNC procedure was attempting 
5 to create features that pay more attention to the relevant features than to the irrelevant 
ones. After pruning disconnected nodes, ihe analytic form of the resulting artificial 
feature is 

This exaoi^le illustrates how GFNC artificial features can provide viable solutions in a 
10 prohibitively large space, in £ict, much larger than the power set of features presented by 
the original feature library (255 feature subsets in this example). 

Prohability estimation. The synthesis of realistic posterior probability functions 
PCSjjz) is among the key elements of the present invention. In addition to class 
discrimination, the intemal discriminant functions of nonparametric classifiers can also 
IS be used for estimation of posterior probabilities. But because these classifiers store all 
training data to directly estimate class-conditional PDFs, they suffer badly firom' the 
"curse of dimensionality' and do not offer much over multidimensional histograms in 
this respect. Their use (especially online) for feature vectors with more than about five 
dimensions becomes intareasingly more questionable, when good PDF estimation requires 
20 on tiie order of 100^ = 10,000,000,000 training tokens. According to the present 
invention, the function P(Sj|x) is neurally learned fix>m data even though the desired 
target probabilities are unknown following the method of trainiag a generalizing neural 
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network with a logistic sigmoid output node (to facilitate learning of the [0,1] range), 
binary {0,1} target outputs y to indicate 'NSy and 'Sr classes respectively (instead of 
the actual probabilities), and a squared error loss function such as those based on the 
naean squared error: 

where is a desired target (either 0 or 1) and is the output of the network (continuous 
between 0 and 1) for iiq>ut x. The binary-target and squared-error conditions cause a 
correctly trained network to output the mathematical expectation of the {0,1} target 
output values, given x. That is, the output of the network is, ideally, the expected value of 

10 the binary random variable ;;|x, 

j) = ^{>'|x} = O P(y = 0|x) + l-P(;); = l|x) , 
which precisely coincides with the desired quantity P(Sjr{x). This happens because the 
iBininiizer of a squared error loss fimction is the conditional mean ftnction (here E{^\ 
Siniply put, in the "battle" between contradictory 0 and 1 examples that are presented to 

15 the network for a given x, the network settles at the average vdue because that is what 
minimizes the squared error at x. This average includes the prior probability infimnation 
based on the relative proportions of the Os and Is. In practice, there may be just one 
input-output example (py, y^), not multiple examples, for a given x. This is where the 
generalizing ability of a neural network comes crucially into play. Although the effects of 

20 model approximation and estimation errors in this setting are not well understood, this 
approach has been immeasely successful in practice. 
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Wavelet Neural Networks. Wavelet neural networks (WMNs) are used in the 
present invention to learn the probability function jPCSjjx) in up to about 10 dimensions 
with moderate amounts of training samples. By having wavelet nodes act like 
"templates" that match the stractore of many functions of practical interest, WNNs offer 

5 efficient solutions to regression and concept learning problems in general. WNNs can 
optimally encode information about one-dimensional functians. Multidimensional affine 
and radial wavelet networks are dense in large spaces, can attain a sup-norm 
approximation rate that is independent of input dim^ion, and their number of 
parameters grows linearly with both dimension and number of nodes, in contrast to the 

10 exponential number needed in traditional polynomial, spUne, and trigonometric 
e3q)ansions. In addition, WNNs can be initialized in ways that significantly speed up 
training. In real-world plications ranging fiom EEG analysis to financial oigineering, 
WNNs have shown a tendency to yield efficient solutions with fewer parameters than 
alternative network architectures for a given level of accuracy. 

IS The basic implementation of a r-minute WNN probability model is a multiple- 

input, single-output transfonnafion: 

P(Sy|x)«j; = -^, 
L+e 

tfrA„,h^ (x) = v^(V(x-bjA„(x-bjT) , 
i/^(x) = min{max{|(l- ] x |),0},1} cos(f ;rx) , 
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where x is presented as a row vector of input features [xi ••• x„], hm is a translation vector 
associated with the wavelet node, A», is a symmetric positive semi-definite 
'^squashing" matrix, M is the number of wavelet nodes, c are linear output coefQcients 
associated with a nonlinear portion of the model, and c'" are linear output coefficients 
5 associated with a linear portion of the model The dependence of this WNN on T is 
implicit by way of the trainix^ data set that is used to tune the network parameters A^, 
b^c,andc'^ 

The number of wavelet nodes M can be initialized based on K-mems clusterings 
of the training data in the input-output space far a successively larger nrnnber of clusters. 
10 Bach clustering is assigned a measure of within- to- between-variance of the clusters. The 
measure is the inverse of a multidimensional F-iatio, 

iilK-w^lfk.-^) 

ti^,iiw,-^i7(^-i) 

f-1 / 

where iVobs is tiie number of observations, K is the number of clusters, Wy is an input- 
output data point [p that belongs to the cluster, Ni is the number of such points m the 
15 cluster, w, is the center of the cluster, and w is the grand mean. The number of 
wavelet nodes is taken to be the minimizer of this measure. As a by-product, initial 
network parameters are obtained firom first and second order statistics of the clusters. 
Alternatively, if computational resources allow it, the WNN can be "grown" firom its 
smallest size to sufficiency based on systematic search and overall performance metrics. 
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For every hypofliesized WNN probability model structure, training of the network 



parameters A^, b,„, c, and c'" is cast as a minimization problem with respect to the 



empirical average squared error function 



5 With the option of tq)eired-ofr data influence, traitdng the probability estimator becomes 
a weighted least squares problem, where the contribution that each squared error makes 
to &e loss function is weighted by wj according to the age of the training datum and some 
forgetting schedule: 



10 In this case, the most recent of the last four months has only an incremental effect on the 
sj^tem, but typically in such a way that it has more influence than the previous months 
on the learned probability fimction. 

These error criteria are used as a guide during minimization using the training set, 
however, care is taken to select a model that attempts to minimize the expected value of 

15 this measure over a validation set representative of future data (not over the particular 
training set). Estimates of E{ASE} can be obtained in principle fix>m regularization 
techniques, or resampling techniques such as leave-one-out cross-validation and 
bootstrap statistics. However, split-sample validation is by far the simplest, yet an 
effective practical technique for discouraging overtraining of the network and thus for 

20 attempting to preserve generalization. The data set is randomly split into a training set 
TRN (e.g., 70% of all (py, yj) examples) and a validation set VAL (e.g,, the remaining 
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30%). If enough examples are available, a completely independent test set TST can also 
be reserved for final assessment of generalization error (fijture performance). Training 
proceeds by mijiimization of error over TRN while monitoring the error on VAL. The 
best WNN on VAL is recorded at every iteration. Typically, the error over TRN drops to 
5 arbitrarily small values (provided a complex enough WNN), while the error over VAL 
first decreases and then increases steadily. The final network chosen is the one that 
minimizes the error over VAL. Note that minimizing VAL error in this fashion does not 
imply memorizing VAL (whose associated enm can always be driven to zero). VAL is 
ideally a representative sample of the universe of all future examples. There is a bias 

10 introduced by this scheme to the extent that VAL deviates from this ideal. The preferred 
minimization algorithms for training the WNN are based on a multi-start Levenberg- 
Marquardt optmiization, a genetic algorithm of ranking type, or a combined genetic 
algorithm global search followed by Levenberg-Marquardt fine tiuung. These 
optimization algorithms and combination schemes are well known in the art. 

15 C!odmg the continuous function P(S7]x) as a WNN satisfies low memory and fast 

tbrougl^ut requirements for real-thne, online operation at the expense of relatively 
difficult offline training. For low feature dimensionality (< 5), the function can also be 
approximated by kNNs or PNNs for fast ofQine training at the expense of large online 
memory and slow thTouglq)ut. Hybrid architectures such as PNN truncated by nearest 

20 neighbors, or vector-quantized PNNs are also possible (although the latter is the same as 
an initialized, untrained radial basis function neural network). With minimal resources, 



68 



wo 02/058536 



PCT/USOI/50046 



the continuous function can ultimately be discretized and implemented in a binaiy 
decision tree (a look-up table witb quantized inpiriis). 

Corrections to probability estimators for prior probability mismatches. The 
posterior probability estimates learned from data sets that do not reflect the true prior 
probability of the classes (6.g., from lEEG archives or clinical trials) will be distorted. 
Correcting posterior probability estimates obtained from type-L discriminant functions 
involves simply plugging the PDFs into Bayes formula with the correct prior 
probabilities. Correcting estimates obtained from type-B discriminants can be 
accomplished widi correction &ctors prior to convex noanalization: 

This corresponds, within a factor, to 



|.(.,C.)^)^ 

Tlbe order of correction and convex normalization operations can be switched, but then a 
second and final convex normalization is required to yield the above result As previously 
15 explained, computational requirements in high dimensions cast serious doubts on the 
results obtained from the above type-L and B discriminants fr>r probability estimation. 

Corrections &t online probability estimators are now derived, Bayes' theorem can 
be written in a form that is compatible with a logistic sigmoid unit 
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1 

at the output of a neural network. The ftmcticn that the network must indirectly learn 
fiom a traitung data set is 

P"^(Sr|x) = 

/>(x|S,)P^(S,) 



pix \Sr)P'^(Sr)+p(x I NSr)P™(NSr) " 
Dividing the Bayes numerator and denominator by the numerator, and using the laws of 
logarithms, we obtain 

1 



p(x|NS,)i>^QqSr) 
1 



1+exp- 



f j;(x|NS,)P^(NS,: 
1 p(x|S,)i'™(S,) 



l+ejqp- 



It can be seen that the neural network's job prior to the logistic sigmoid output unit is to 
approximate the term between square brackets: the log-likelihood ratio function 
(ind^endent of prior probabilities) plus a bias term (dqiendent on prior probabilities). 
Therefore to correct P'^^'^'^CSiix) after having aheady trained the neural network, all that is 
necessary is to go inside the network and replace the constant bias tenn with the correct 
one. For the WNN, this is 

^0 



P'™(NSy) 
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Even if flie neural network is 0y&a. as a black box, if P^iSj) is known, then Ibe ratio of 
posterior probabilities of the two classes can be used to recover the likelihood ratio: 

p(xlNSy) P™^(NSrlx)P™'(Sr) ' 
which can then be plugged into Bayes fbimula to conqmte tiie corrected probabilify as 

An important additional advantage of these correction methods is that training can 
be done purposely under a balanced (50%-50%) or any other convenient proportion (as 
they come from the clippiags of the lEEG archives), and still obtain the probability 
estimates for the unbalanced, real-life condition. An increased variance in Ihe unbiased 

10 estimate can be e^^ected, of course, just like in any other estunatian problem with a 
reduced sample size. 

The present invention is realized in a conibinadonofhaidware and software. Any 
kind of computer system or other apparatus adapted for carrying out the methods 
described herein is suited. A typical combination of hardware and software could be a 

15 general purpose computer system with a computer program that, when loaded and 
executed, controls the computer system such that it carries out the methods described 
herein. The present invention can also be embedded in a computer program product 
which includes all the features enabling the implementation of the methods described 
herein, and which, when loaded in a computer system is able to cany out these methods. 
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Computer program instructions or computer program in the present context means 
any expression in any language, code, or notation or a set of instractions intended to 
cause a system having an information processing C£^ability to perform a particular 
function, either directly or when either or both of the following occur: (a) conversion to 
5 another language, code or notation; (b) reproduction in a different material form. 

In light of the above teachings, those skilled in the art will recognize that &e 
disclosed methods, formulas, algorithms, and embodiments may be replaced, modified, 
or adapted without departing from tiie spirit or essential attributes of the invention. 
Therefore, it should be understood that within the scope of the appended claims, this 
10 invention may be practiced otherwise than as exemplified herein. 
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What is claimed is: 



1. A method for automatically ptedictmg and preventing the electiogr^Mc onset of 
a seizure in an individual, comprising the acts of: 

monitoring a plurality of signals mdicative of the activity of the brain of 
the individual; 

extracting a set of features from the signals and forming an optimal feature 
vector; 

synthesizing a probability vector based on the optimal feature vector as an 

estimator of the likelihood of seizure for a plurality of prediction 

time intervals; and 
preventing the electrograpMc onset of a seizure by the automatic 

iq[>plication of at least one mtervention measure that is 

commensurate with tiie likelihood of seizure. 



2. The method for automatically predicting and preveatmg tiie electtx>graphic onset 
of a seizure of claim 1 fiulher comprising the act of displaying the probability 
vector of oncoming seizures for a plurality of prediction time intervals to indicate 
both the time firame when the electrographic onsete are expected to occur and a 
degree of confidence in the predictions of electrographic onsets. 
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3. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 1 wherein the act of preventing the electrographic onset of 
seizures automatically ^plies the minimally required intervention measure 
thereby minimizing associated side effects. 



4. The method for automatically predicting and preventing the electrogr^hic onset 
of a seizure of claim 3 wherein the minimally required intervention measure starts 
with a benign form of therapy and continues with more aggressive ther^ies as &e 
probability vector continuously changes with decreasing time to electrographic 
onset 



5. The method for automatically i>redicting and preventing flie electrographic onset 
of a seizure of claim 3 wherein Uie minimally required intecveation measure 
couples minimally invasive benign therapies to long prediction time intervals, 
aggressive therapies to short prediction time intervals and gradually varying 
moderate forms of therapy to prediction time intervals that are between the long 
and short prediction time intervals. 



6. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 5 wherein the minimally required intervention measure 
^lied matches the tolerance for felse positive and false negative prediction 
errors for each prediction time interval. 
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7. The method for aatomatically predicting and preventing the electrograpMc onset 
of a seizure of claim 1 wherein the intervaition measure includes at least one of 
cognitive stimulation, sensory stimulation, biofeedback, electrical stimulation and 
pharmacological in&sion. 

8. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 1 further comprising the act of applying a closed loop 
feedbadk control law and commanding a therapy actuator in order to regulate the 
seizure probability vector. 

9. The method for automatically predicting and preventing the electrogn^hic onset 
of a seizure of claim 8 wheremthe applied closed loop feedback contix)l law and 
therapy actuator mcludes any of a proportional, proportional-mtegral-derivative 
(PID), optimal continuous, gain scheduled, multilevel and bang-bang control 
strategies to regulate the seizure probability vector as a controlled variable. 

10. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 6 wfaerdn the intervention measure is triggered by 
prediction thresholds with high sensitivity and low spedficily. 
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1 1 . The method for automatically predicting and preventing the electrographic onset 
of . a seizure of claim 1 wherein the intervention measure is delivered in at least 
one of a region of onset and a distribution region surrounding the region of o&set 

12. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 1 wherein the intervention measure is delivered in 
subcortical regions including at least one of the thalamus, basal ganglia, and other 
deqp nuclei. 

13. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 1 wherem if the electrograhic onset occurs, applying 
treatment to either at least one of a general region of onset and deep brain 
structures to modulate the behavior of the seizure &cus. 

14. The method for automatically predicting and preventing the electrogn^hic onset 
of a seizure of claim 1 wherein the intervention measure plication includes at 
least one of: 

rhythmic electrical pacing that changes in frequency, intensity and 
distribution as the probabiUty of a seizure onset reaches and 
exceeds a threshold; 

chaos control pacing; 

random electrical stimulation to interfere with developing coherence in 
activity in a region of, and surrounding, an epileptic focus; 



76 



wo 02/058536 



PCT/USOl/50046 



depolarization or hypeipolarization stimuli to silence or suppress activity 
in actively dischargiiig regions, or regions at risk for seizure 
spread. 

1 5. The method for automatically predicting and preventing the electrogr^hic onset 
of a seizure of claim 14 wherein the intervention measure is delivered to a 
plurality of electrodes to provide a surround inhibition to prevent a progression of 
a seizure i^recursor. 

1 6. The method for automatically predicting and preventing the electrogr^hic onset 
of a seizure of claim 14 wherein the intervention measure is delivered 
sequentially in a wave that covers a cortical or subcortical region of tissue so as to 
progressively inhibit normal or pathological neuronal function in the covered 
region. 

17. The method for automatically predicting and preventing tiie electrogrq)hic onset 
of a seizure of claim 1 wherein the intervention measxjre application is an 
infusion of a th^£q>6utic chemical agent into a brain region where seizures are 
generated, or to which fhey may spread. 

1 8. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 17 wherein the chemical agent is delivored in greater 



77 



wo 02/058536 



PCT/USOl/50046 



quantity, conceatration or spatial distribution as the probability of seizure 
increases. 

1 9. The method for automatically predicting and preventing the electrographic onset 
of a sei2xire of claim 17 wherein the int^ention measure is applied to at least 
one of an qpilectic focus, an area suirounding the epilectic focus, a region 
involved in an early spread, and a central or deep brain region to modulate seizure 
propagation. 

20. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 17 wherein the therapeutic chemical agent is activated by 
oxidative stress, and increases in concentratian and distribution as the probability 
of seizure increases. 

21. The method for automatically predictmg and preventing the electrographic onset 
of a seizure of claim 1 wherem the intervention measure is delivered to central 
nerves or blood vessels in a graduated manner as the probability of seizure 
increases. 

22. The method for automatically predicting and preventing the electrographic onset 
of a seizure of claim 1 wherein the intervention measure is a plurality of artificial 
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neuronal signals delivered to disrupt eletrochemical traffic on at least one 
neuronal network that includes or communicates with an ictal onset zone. 

23 . A method for assessing a quality of life index in an individual subj ect to seizures 
in order to adjust an implanted device to optimize patient-specific feature signals 
and treatment therapies, comprising the acts of: 

accumulating the energy of raw intracranial electroencephalograms 
(lEEG) for the individual over multiple data channels during 
seizures over a fixed time period 
accumulating the energy of a treatment control effort over the multiple 
data channels over all times of activation of the implanted device 
over a fixed time period; 
weighting the accumulated energy of the lEEG and the accumulated 
energy of the control effort by seizure and treatment fectors to 
drtermine a quality for the fixed period of time; and 
determining a quality of life index as a weighted average of a current and 
previous qualities for a plurality of fixed time periods. 



24. The method far assessing the quality of life index of claim 23 wherein the fixed 
time period is a day. 
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25. The method for assessing the quality of life index of claim 24 wherein the quality 
of hfe index is an exponratially-weighted moviag average of the daily quahty 
which is based on a forgetting schedule to taper off the influence of older data. 

26. The method for assessing the quality of life index of claim 25 wherein the 
forgetting schedule is designed so that tbe data is 98% forgotten over a period of 
thirty days. 

27. The method for assessing the quality of life index of claim 26 wherein the qualily 
of life index is calculated recursively using weights that are made to decay within 
four time constants of the natural exponential function. 

28. The method for assessmg the quality of life index of claim 23 wherein the seizure 
and treatment weights are relative weights that sum to one. 

29. A method for periodic learning to improve and maintain the performance of a 
device implanted in an individual subject to seizures to provide treatment 
therapies, comprising the acts of: 

assessing a quality of life index that penalizes the intensity, duration and 
frequency of seizures and treatments over a fixed period of time; 

marking a time of unequivocal electrographic onset (UEO) in all recorded 
seizures over a previous fixed period of time; 
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creating learning sets of data based on the UEOs by clipping all the lEEG 

qpochs immediately preceding seizures and labeling the clipped 

epochs as preseizure raw data; 
clipping and labeling randomly chosen, non-overlapping data as non- 

preseizure or baseline raw data; 
generating a time series of all features in a feature library fiom the 

preseizure and nonpreseizure raw data; 
searching for an optimal feature vector in a power set of the feature library 

to minimize a classifier-based performance metric; 
synthesizing a posterior probability estimator far the optimal feature 

vector, and 

coupling an optimal tho-apy activation threshold to the probability 
estimator. 

30. The method for periodic learning of claim 29 further comprising joining the 
pFBseizoFe and nonpreseizure raw data sets with the corresponding data sets for 
the three preceding fixed periods of time in order that the preseizure and 
nonpreseizure raw data sets have an incremental effect on learning. 

3 1 . The method for periodic learning of claim 30 wherein four preseizure and four 
nonpreseizure data sets correspond to a four period rectangular moving window 
of learning data. 
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32 . The method for periodic learning of claim 30 wherein four preseizure and four 
nonpreseizure data sets correspond to a window that tapers off the last four 
periods according to a fisrgetting schedule. 



33. The method for periodic learning of claim 29 further conqxrising preoptimizuig 
parameters associated with a time history of signal features and thresholds using 
statistical measures of separability. 



34. The method for periodic learning of claim 33 wherein the time history of signal 
features includes an observation window length, and a di£^lacement between 
window slides as a feature sanq>ling period. 



35. The method for periodic learning of claim 33 wherein the thresholds include 

thresholds on amplitude, thresholds on duration and thresholds on density count. 



36. The method for periodic learning of claim 33 wherein statistical measures of 
linear separability include t-scores. Fisher discriminant ratios and K-&ctors. 



37. The method for periodic learning of claim 33 wherein statistical measures of non- 
linear separability include likelihood scores, probability of errors, error risk and 
ovraallrisk. 
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38. The method for periodic learning of claim 33 fijrther comprising generating a time 
series of each feature in the feature library using a short calibration epoch of 
preseizure and nonpreseizure data under a hypothesized set of parameters for the 
feature. 

39. The method for periodic learning of claim 3 8 further comprising determining a 
parameter set that maximizes separaibiUty by a greedy algorithm that varies one 
parameter at a tioie. 

40. The method Six periodic learning of claim 33 wherein the act of searching for an 
optimal feature vector further comprises the acts of: 

generating trajectories of candidate feature vectors by time synchronizmg 
the tune series of all features in the feature library; 

synthesizing noi^arametric classifiers that niemorize a training data set to 
simulate thenpy activation decision rules and accelerate feature 
optimization; 

correcting a discriminant function of the nonparametric classifiers to 

determine the optimal decision rule; 
measuring an overall risk over a validation set of data that is not directly 

used for synthesizing the nonparametric classifiers; and 
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detamining a next candidate feature vector using a heuristic search 
criterion. 

41 . The method for periodic learning of claim 40 further comprising compressing a 
selected feature vector into a single genetically-found, neurally-computed 
artificial feature with equal or better discriminatory properties. 

42. The metiiod for periodic learning of claim 40 further comprising charting rank- 
order curves for ov^all risk for the best n vectors. 

43 . The method for periodic leardng of claim 42 vdtierein the rank-order curves plot 
one minus the overall risk versus n. 

44. The method for periodic learning of claim 40 further comprising assessing an 
expected performance for a next month by calculating m average overall risk 
using cross-validation over all available data. 

45. The method for periodic learning of claim 40 further comprising assessing an 
expected performance for a next month by calculating an average overall risk 
using a single indq>endeQt test set not used during the periodic learning. 
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46. The method for periodic learning of claim 29 wherem the act of synthesizing a 
posterior probabiUty estimator fiirther comprises the acts of: 

training a wavelet neural network to calculate posterior probability 

estimators; 

minimizing an expected value of a squared error loss function on a 

validation set of dats^ and 
selecting a wavelet neural network that minimizes flie error over the 

validation set of data. ■ 



47. The method for periodic learning of claim 46 further comprising the act of 
correcting the posterior probability estimator's bias term based on prior 
probability mismatdies. 



48. The method for periodic learning of claim 46 wherein the wavelet neural network 
includes a logistic sigmoid ou^ut unit and calculates posterior probability 
estimators by presenting {0, 1 } targets to indicate a nonseizure class and a seizure 
class, respectively. 

49. The method for periodic learning of claim 29 further comprismg tiie act of 
ranking the desirability of feature vectors using a plurality of sets of classifier- 
based performance metrics. 
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50. The method for periodic learning of claim 49 wherein a first set of classifier- 
based perfotmance metrics includes a probability of a correct positive, a 
probability of a &lse negative, a probability of a coirect negative, a probability of 
a false positive, a probability of correct classification, a probability of error, a 
selectivity vdue and a balance value. 

5 1 . The method for periodic learning of claim 49 wherein a second set of classifier- 
based performance metrics includes a prior probability of preseizure and a prior 
probability of nonpieseizure. 

52. The method for periodic learning of claim 49 wherein a third set of classifier- 
based performance metrics includes a false positives per hour measure, an average 
detection delay measure, an error risk measure and an overall risk measure. 

53 . The method for periodic leamiag of claim 52 wherdn the error risk measure is 
detennined by applying penalty factors to the probability of a felse negative and 
the probability of a &lse positive 

54. The method for periodic leaming of claim S3 wherein the penally &ctQr for a 
false negative representing a missed seizure detection is greater than the penalty 
factor for a false positive representing a sdzotic alarm. 
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55 . The method for periodic learning of claim 52 wherein flie overall risk measure 
penalizes aU therapy activations. 

56. The method for periodic learning of claim 55 wherein the overall risk measure is 
determined by applying penalty factors to the probability of a false negative, the 
probability of a &lse positive, the probability of a correct positive and the 
probability of a correct negative. 

57. The method tar periodic learning of claim 56 wherein the penalty &ctor 
associated with the probability of a correct negative is set equal to zero. 

58. The method for periodic learning of claim 56 wherein the relative penalty factors 
are small for the probability of a correct positive, medium for a &lse positive and 
very large for a false negative. 

59. The method for periodic learning of claim 40 wherein the act of searching for an 
optimal feature vector further comprises using an optimalily criterion to determine 
an optimal decision rule. 

60. The method for periodic learning of claim 59 wherein the optimality criterion is 
any one of a maxunum likelihood estimate, a minimum error estimate, a Neyman- 
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Pearson criterion, a minimum error risk estimate, a minimax error risk estimate, 
attd a miaimum overall risk estimate. 

6 1 . The method for periodic learning of claim 60 wherein the Neyman-Pearson 
criterion is ^plied inverted as a decision rule for seizure detectors in which the 
&lse positive rate is miniinized for a chosen constant false negative rate. 

62. The method for periodic learning of claim 40 wherein the nonparametric 
classifiers that memorize the training data set to simulate therapy activation 
decision rules and accelerate feature optimization include k-neaiest nei^ibors 
OcNNs), probabilistic neural networks (PNNs) and hybrids. 

63. The method for periodic learning of claim 62 wherein a pluraUty of distances in 
kNNs and a plurality of kernel heights of PNNs are weighted with forgetting 
&ctors according to the periods of tune in. which the training data is collected. 

64. The method for periodic learning of claim 40 wherdn the act of correcting a 
discriminant function of ^ noiq)aramebic classifier includes the act of applying 
a correction fector to the discriminant function to correct for prior probabiUty 
mismatches between the a priori probability of seizure estimated ftom data and 
the true probability of seizure. 
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65. The method for periodic learning of claim 29 wherein the act of searching for an 
optimal feature vector includes flie act of performing a forward sequential search 
through the feature library. 



66. The method for periodic learning of claim 65 wherein the forward sequential 
search includes the acts of: 

determimo^ a score for each feature in the feature library; 

selecting as a first feature, the feature that has the high^t score and adding 

it to the feature vector; 
selecting as a second feature, the feature among the remaining features in 
the feature library that works best in conjunction with the first 
feature; 

selecting the additional features required sequentially to complete the 
feature vector by selecting among the remaining features, the 
feature that works best in conjimction with all previously selected 
features. 



The meOiod for periodic learning of claim 29 whereui tiie act of searching for an 
optimal feature vector includes the act of performing an add-on, knock-out search 
through the feature library. 
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68. The method for periodic learning of claim 67 wherein the add-on, knock-out 
search includes the acts of: 

detenniniitg a score for each feature in <he feature library; 

performing a forward sequential search to select m candidate features one 

at a time and adding eadi to the feature vector; 
removing n worst candidate features &am Ihe feature vector, where n is 

less than m; 

repeating the paforming and removing acts iteratively until k features are 
selected for the feature vector. 

69. A computer readable medium containtog a computer program product for 
automatically predicting and preventing the electrographic onset of a seizure hi an 
individual, the computer program product comprising: 

program instructions that monitor a plurality of signals indicative of the 

activity of the brain of the individual; 
program instructions that extract a set of features fiom the signals and 

form an optimal feature vector; 
program instructions that synthesize a probability vector based on the 

optimal feature vector as an estimator of the likelihood of seizure 

for a plurality of piedictian time intervals; and 
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program instructions that prevent Ite electrographic onset of a seizure by 
initiating the automatic application of at least one intervention 
measure that is commensurate with the likelihood of seizure. 

70. The computer program product for automatically predicting and preventing the 
electrographic onset of a seizure of claim 69 further comprising program 
instructions that display the probability vector of oncoming seizures for a plurality 
of prediction time intervals to indicate botii the time firame when the 
electrographic onsets are expected to occur and a degree of confidence in the 
predictions of electrographic onsets. 

71 . The computer program product for automatically predicting and preventing the 
electrographic onset of a seizure of claim 69 wherem the program instructions that 
prevent tiie electrographic onset of seizures initiate the automatic plication of 
the minimally required intervention measure thereby mmitniziTig associated side 
effects. 

72. The computer program product for automatically predicting and preventmg the 
electrographic onset of a seizure of claim 71 wherein the minimally required 
intervention measure that is initiated couples minimally invasive benign therapies 
to long prediction time intervals, aggressive ther^ies to short prediction time 
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intervals and gradually varying moderate forms of therapy to prediction time 
intervals that are between the long and short prediction time intervals. 

73. The computer program product for automatically predicting and preventing the 
electrogr^hic onset of a seizure of claim 69 wherein the intervention measure 
that is initiated includes at least one of cognitive stimulation, sensory stimulation, 
biofeedback, electrical stimulation and pbarmacologpical infusion. 

74. The computer program product for automatically predicting and preventing the 
electrographic onset of a seizure of claim 69 fiirther comprising program 
instructions that ^ly a closed loop feedback control law and command a therapy 
actuator in order to regulate the seizure probability vector. 

75. The computer program product for automatically predicting and preventing the 
electrogr^bic onset of a seizure of claim 74 wherdn the i^lied closed loop 
feedback cantrol law and therapy actuator that is applied includes any of a 
proportional, proportional-integral-derivative (PID), optimal continuous, gain 
scheduled, multilevel and bang-bang control strat^es to regulate the seizure 
probabilily vector as a controlled variable. 



76. A computer readable medium containing a computer program product for 

assessing a quality of life index in an individual subject to seizures m order to 
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adjust an implanted device to optimize patient-specific feature signals and 
treahnent ther^ies, the computer program product comprising: 

program instructions that accumulate the energy of raw intracramal 

electroencephalograms (lEEG) for the individual over multiple 
data channels during seizures over a fixed time period; 

program ktstractions that accumulate the eaeigy of a treatment control 
effort over the multiple data channels over all times of activation 
of the implanted device over a fixed time period; 

program instructions that weight the accumulated energy of the lEEG and 
the accumulated energy of the control effort by seizure and 
treatment factors to determine a quahly for the fixed period of 
time; and 

program instructions that determine a quality of life index as a weighted 
average of a current and previous qualities for a plurality of fixed 
time periods. 

77. The computer program product for assessing a quality of life index of claim 76 

further comprising program instructions that determine the quality of life index as 
an exponraitially-wcighted moving average of the daily quality which is based on 
a forgetting schedule to taper off the influence of older data. 
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78. The computer program product for assessing the quality of life index of claim 77 
wherein the program instructions that deterarine the quality of life index calculate 
the index recursively using weights that are made to decay within four time 
constants of the natural exponential function. 



79. A computer readable mediiun containing a computer program product for periodic 
learning to improve and maintatn the performance of a device implanted in an 
individual subject to seizures to provide treatment ther^ies, ttie computer 
program product comprising^ 

program instructions that assess a quality of Ufo index that penalizes the 
intensity, duration and firequency of seizures and treatments over a 
fixed period of time; 
program instructions that collect a time of unequivocal electrogn^hic 
onset (UEO) in all recorded seizures over a previous fixed period 
of time; 

program instructions that create leacmng sets of data based on tibie UEOs 
by clipping all the BEEG epochs immediately preceding seizures 
and labeling the chpped epochs as preseizure raw data; 

program instructions that clip and label randomly chosen, non-overl^ing 
data as non-preseizure or baseline raw data; 

program instructions that generate a time series of all features in a feature 
library fiom the preseizure and noiqnreseizure raw data; 



94 



wo 02/058536 



PCT/USOl/50046 



program instructions that search for an optimal feature vector in a power 
of liie feature library to minimize a classifier-based 

performsnce metric^ 
program instructions that synthesize a posterior probability estimator for 

the optimal feature vector, and 
program instructions that coiq>le an optimal then^y activation threshold to 

the probability estimator. 

80. The computer program product far periodic learning of claim 79 further 
comprismg program instructions that join the preseizure and nonpreseizure raw 
data sets with the corresponding data sets for the three preceding jBxed periods of 
time in order that the preseizure and nonpreseizure raw data sets have an 
incremental effect on learning. 

8 1 . The computer program product for periodic learning of claim 79 fiuther 
comprising program instructions that preoptimize parameters associated with a 
time history of signal features and thresholds using statistical measures of linear 
separability. 

82. The computer program product for periodic learning of claim 8 1 further 
comprising program instractions that generate a time series of each feature in the 
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feature library using a short calibration epoch of preseizure and nonpieseizure 
data under a hypothesized set of parameters for the feature. 

83 . The computer program product for periodic learning of claim 69 further 
comprising program instructions that determine a parameter set that maximizes 
separability by a greedy algorithm that varies one parameter at a time. 

84. The computer program product for periodic learning of claim 79 wherein the 
program instructions that search far an optimal feature vector furfha: comprise: 

program instructions that generate trajectories of candidate feature vectors 
by time synchronizing the time series of all features in the feature 
library; 

program instructions that synthesize nonparametric classifiers that 
memorize a training data set to simulate Hasxapy activation 
decision rules and accelerate feature optiniization: 

program instructions that correct a discriminant function of the 

noiq)arametric classifiers to determine ttie optimal decision rule; 

program instructions that measure an overall risk over a validation set of 
data that is not directly used for synthesizuig the nonparametric 
classifiers; and 

program instructions that determine a next candidate feature vector usmg a 
heuristic search criterion. 
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85. The computer program product for periodic learning of claim 84 further 
comprisiag program instmctiQiis that compress a selected feature vector into a 
single genetically-found, neurally-computed artificial feature with equal or better 
discriminatory properties. 

86. The computer program product for periodic learning of claim 84 further 
comprising program instructious that chart rank-order curves for overall risk for 
the best n-vectors. 

87. The coniputer program product for periodic learning of claim 84 fiirther 
comprising program instructions that assess an expected performance for a next 
month by calculating an average overall risk using cross-validatiQn over all 
available data. 

88. The computer program product for periodic learning of olaun 84 further 

comprisiag program instructions that assess an ejfpected performance for a next 
month by calculating an average overall risk using a single independent test set 
not used during the periodic learning. 



97 



wo 02/058536 



PCTAJS01/S0046 



89. The computer program product for periodic learning of claim 84 whereia the 
program instructions that synthesize a posterior probability estimator fiirther 
comprises: 

program instructions that train a wavelet neural network to calculate 

posterior probability estimators; 
program instructions that minimize an ^ected value of a squared error 

loss fimction on a validation set of data; and 
program instructions that select a wavelet neural network that minimizes 

the &xar over the validation set of data. 



90. The computer program product for periodic learning of claim 89 further 

comprising program instructions that correct the postmor probability estimator's 
bias term based on prior probability mismatches. 



91 . The computer program product for periodic learning of claim 89 wherein the 
wavelet neural network includes a logistic sigmoid output unit and. program 
instructions that calculate posterior probability estimators by presenting {0,1} 
targets to indicate anonseizure class and a seizure class, respectively. 



92. The computer program product for periodic leamiag of claim 84 further 

comprising program instructions that rank the desirability of feature vectors usmg 
a plurality of sete of classifier-based perjformance metrics. 
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93. The computer program product for periodic learning of claim 92 wHerein a first 
set of classifier-based performance metrics includes a probability of a correct 
positive, a probability of a false negative, a probability of a correct negative, a 
probability of a Mse positive, a probability of correct classification, a probability 
of error, a selectivity value and a balance value. 

94. The computer program product for periodic learning of claim 92 wherera a second 
set of classifier-based performance metrics includes a prior probability of 
preseizure and a prior probability of nonpreseizure. 

95. The conqjuter program product for periodic learning of claim 92 wherein a ftmd 
set of classifier-based performance metrics includes a &lse positives per hour 
measure, an average detection delay measure, an error risk measure and an overall 
risk measure. 

96. The computer program product for periodic learning of claim 92 further 
comprising program instructions that determine the error risk measure by 
applying penalty factors to the probability of a false negative and the probability 
ofa false positive. 
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97. The computer program product for periodic learning of claim 84 wherein the 
program instructions that search for an optimal feature vector further comprises 
program instnictions that use an optimality criterion to determine an optimal 
decision rule. 

98. The computo: program product for periodic learning of claim 97 wherein the 
optimality criterion is any one of a maximum likelihood estimate, a minimum 
error estinaate, a Neyman-Pearson criterion, a minimum error risk estimate, a 
TniriiTmmi ecFor risk estimate, and a miTiitmim overall risk estimate. 

99. The computer program product for periodic leaming of claim 98 wherein the 
Neyman-Pearson criterion is applied inverted as a decision rule for seizure 
detectors in which the false positive rate is minimized for a chosen constant false 
negative rate. 

100. The computer program product for periodic leaming of claim 84 wherein the 
noiq)arametric classifiers that monorize the training data set to simulate therapy 
activation decision rules and accelerate feature optimizatian include k-nearest 
neighbors (kNNs), probabilistic neural networks (PNNs) and hybrids. 

101. Hie computer program product for periodic learning of claim 100 furflier 
comprising program' instructions that weight a plurality of distances in kNNs and 
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a plurality of kernel height of PNNs with forgetting fectors according to the 
periods of time in which the training data is collected. 

102. The computer program product for periodic learning of claim 84 whereia the 
program instructions that correct a discrindnaut function of the nonparametric 
classifier includes program instructions that apply a correction Victor to the 
discriminant function to correct for prior probability mismatches between the a 
priori probability of seizure estimated &om data and the true probability of 
seizvae. 

103. The compute program product for periodic learning of claim 84 wherein program 
instructions that search for an optimal feature vector includes program 
instructions that perform a forward sequential search through the feature library. 

104. The computer program product for periodic learning of claim 103 wherein the 
program instructions that perform a forward sequential search includes: 

program instructions that determine a score for each feature in the feature 
library; 

program instructions that select as a first feature, the feature that has the 
highest score and addmg it to the feature vector; 

program instructions that select as a second feature, the feature among the 
remaining features in the feature library that works best in 
conjunction with the first feature; and 
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program instructions that select the additional features required 

sequentially to complete the feature vector by selecting among the 
remaining features, the feature tiuit woiks best in conjunction with 
all previously selected features. 



105. The computer program product fear periodic learning of claim 79 wherem Has 
program instructions that search for an optimal feature vector includes program 
instructions that perform an add-on, knock-out search through the feature library. 



106. The computer program product for periodic learning of claim 105 wherein the 
program instructions that perform the add-on, knock-out search includes: 

program instructions that determine a score for each feature in the feature 
library; 

program instructions tiiat perform a forward sequential search to select m 

candidate features one at a time and add each to the feature vector; 
program instructions that remove n worst candidate features £ix>m the 

feature vector, where n is less than m; and 
program instructions that repeat the performing and removing program 

instructions iteratively until k features are selected for the feature 

vector. 
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1 07. A system for automatically predicting and preventing the electrogrsphic onset of a 
seizure in an individual, comprising: 

a signal acquisition component to condition and digitize a plurality of raw 
signals received from a transducer implanted in the individual; 

a preprocessor to attenuate any artifacts in the plurality of digitized 
signals; 

a feature extraction component containing processing logic to select 

patimt-specific seizure-predictive and seizure-indicative attributes 
from the preprocessed signals to form an optimal feature vector; 

a probability estimator component that synthesizes a probability vector as 
an estimator of the likelihood of seizure for a plurality of 
prediction times; 

a multitherapy activation component containing processing logic to 
determine the ther^y modalities that are to be activated or 
deactivated at any time; and 

an implanted device including a plurality of therapy actuators to 

automatically activate at least one associated therapy in response to 
an output signal from the multitherapy activation component. 

1 08 . The system for automatically predicting and preventing the electrographic onset 
of a seizure of claim 107 wherein the therapies associated with the plurality of 
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therapy actuators include one or more of cognitive stimulation, sensory 
stimulation, biofeedback, electrical stimulation and pharmacological infasion. 

109. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the preprocessor is implemented in any one or 
more of a microprocessor, a digital signal processor, a field programmable gate 
array, an plication specific integrated circuit, and a hybrid analog/digital circuit. 

110. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the feature extraction component is 
implemented in any one or more of a microprocessor, a digital signal processor, a 
field programmable gate array, an application specific integrated circuit, and a 
hybrid analog/digital circuit. 

111. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the probability estimator component is 
implemented in any one or more of a microprocessor, a digital signal processor, a 
field programmable gate array, an application specific integrated circuit, and a 
hybrid analog/digital circuit. 

1 12. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the multitherapy activation component is 
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implemented in any one or more of a microprocessor, a digital signal processor, a 
field programmable gate array, an application specific integrated circuit, and a 
hybrid analog/digital circuit. 

113. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the plurality of therapy actuators are 
programmed to operate in an open loop mode, delivering prophylactic treatment 
by any one of a button, a magnet, and a vibratiou transducer. 

1 14. The system for automatically predicting and controlling the electrogr^hic onset 
of a seizure of claim 107 wherein the plurality of therapy actuators are 
programmed to operate in a closed loop mode, using any one of a proportional, a 
proportional-integral-derivative, an optimal continuous, a gain-scheduled, a 
multilevel, and a bang-bang feedback control strategy. 

lis. The system for automatically predicting and controlling the electrographic onset 

of a seizure of claim 107 further comprising an extemal storage for storing the 
digitized signals processed by the signal acquisition component. 

116. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 fiirther comprising a wearable access device to display 
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the probability vector of onconmig seizures and. the multithecapy activation status 
to the individual. 

117. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the implanted device further includes an 
electronically erasable programmable read-only memory for downloading a 
plurality of learned algorithms and parameters received via transceivers. 

118. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107, further comprising a processor including a centralized 
learning and programming iatsx&sx logic for simulating the actions of the feature 
extraction component, probability estimator component, and multither^y 
activation component in order to improve the performance of each component. 

119. The system for automatically predicting and controlling the electrographic onset 
of a seizure of claim 118 wherein the processor is a laptop or workstation 
compute. 

120. The system for automatically predicting and controlling tiie electrographic onset 
of a seizure of claim 119 wherein the laptop or workstation computer is at a 
location different from that of the individual. 
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121 . The system for automatically predicting and controUing the electrographic onset 
of a seizure of claim 118 wherein the processor is located at a remote facility that 
is accessible over the Internet via a Tl line or other high speed digital link. 

122. The sj'stem for automatically predicting and controlling the electrographic onset 
of a seizure of claim 107 wherein the plurality of tbsrapy actuators automatically 
apply the minimally required iiiterveiition measure, thereby minimizing 
associated side effects. 

123. The system for automatically predicting and controlling the dectrogtaphic onset 
of a seizure of claim 122 wherein the minimally required intervention measures 
starts with a benign form of therapy and continues with more aggressive therapies 
as the probability vector continuously changes wilfa decreasing time to 
electrogr^qphic onset 

124. The system for automatically predictii^ and controlling the electrographic onset 
of a seizure of claim 122 wherein the minimally required intervention measures 
couples minimally invasive benign therapies to long prediction time intervals, 
aggressive ther^ies to short prediction time intervals and gradually varying 
moderate forms of therapy to prediction time intervals that are between the long 
and short prediction time intervals. 
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125. A system for assessing a quality of life index in an individual subject to seizures 
in order to adjust an implanted device in order to optiniize patient-specific feature 
signals and treatment liierapies comprising: 

a signal acquisition componait to condition and digitize a plurality of raw 
signals received on multiple data channels from a transducer 
implanted in an individual; 
a first storage for accumulating the energy of raw intracranial 

electroencephalograms (lEEG) for the individual on multiple data 
channels during seizures over a fixed time period; 
a second storage for accumulating the energy of a treatment control effort 
on the multiple data channels over all times of activation of the 
implanted device over a fixed time period; 
a processor including: 

a first logic module far weighting the accumulated energy 
of the lEEG and the accumulated energy of the 
control effort by seizure and treatment factors to 
determine a quality for the fixed period of time; and 
a second logic module for determining a quality of life 
index as a weighted average of a current and 
previous qualities a plurality of fixed time 
periods. 
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126. The system for assessing fhe quality of life index of claim 125 wherein the second 
logic module for detennining the quality of life index uses an OiponeatiaUy- 
weighted moving average of the daily quality wbidi is based on a forgetting 
schedule to taper off the infhience of older data. 

127. A system for periodic learning to improve and mauitain the performance of a 
device implanted in an mdividoal subject to seizures in providing treatment 
therapies comprising: 

a sign^ acquisition component to condition and digitize a plurality of raw 

signals received fiom a transducer implanted in the individual; 
a processor coupled to ibe signal acquisition component and including a 
learning and training module for 

assessing a quality of life index that penalizes the intensity, 
duration and frequency of seizures and treatments 
over a fixed period of time; 
marking a time of unequivocal electrogrq)hic onset (UEO) 
in all recorded seizures over a previous fixed period 
of time; 

creating learning sets of data based on the UEOs by 

dipping all the lEEG epochs immediately preceding 
seizures and labeling the clipped epochs as 
preseizure raw data; 
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clippiag and labeling landomly chosen, non-ovarla[>ping 

data as ncm-preseizure or baseline raw data; 
generating a time series of aU features in a feature library 

fixnn the presdzure and notqireseizure raw data; 
searching for an optimal feature vector in a power set of the 

feature library to minimize a classijGier-based 

performance metric; 
synthesizing a posterior probability estimator for the 

optimal feature vector; and 
coupling an optimal therapy activation threshold to the 

probabilily estimator. 



128. The system for periodic learning of claim 127 wherein the learning and training 
module further comprises programming logic for: 

generating trajectories of candidate feature vectors by tune synchronizing 

the time series of all features in tiie feature library; 
synthesizing nonparametric classifiers that memorize a training data set to 
sunulate therapy activation decision rules and accelerate feature 
optmiiz!ation; 

correcting a discriminant function of the nonparametric classifiers to 
determine the optimal decision rule; 
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measuring an overall risk over a validation set of data ttiat is not directly 
used for synthesizing the nonparametric classifiers; and 

det6rmfijf:!g a ne^ct candidate feature vector using a heuristic search 
criterioa. 

129. The system for periodic learning of claim 127 wherein the learning and training 
module further conq)riBes programming logic for: 

training a wavelet neural network to calculate posterior probability 
estimators; 

minimizing an e3q)ected value of a squared error loss fimction on a 

validation set of data; and 
selecting a wavelet neural network that minimizes the error over the 

validation set of data. 

130. The system for periodic learoing of claim 127 wherein the learning and training 
module further contprises programming logic for ranking a desirability of feature 
vectors using a plurahty of sets of classifier-based perfohnance metrics. 

131. The system for periodic learning of claim 130 wherdn a first set of classifier- 
based performance metrics includes a probability of a correct positive, a 
probability of a felse negative, a probability of a correct negative, a probability of 
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a felse positive, a probability of correct classification, a probability of error, a 
selectivity value, and a balance value. 

1 32. The system for periodic leaming of claim 130 wherein a second set of classifier- 
based performance metrics includes a prior probability of preseizure and a prior 
probability of nonpreseizure. 

133. The system for periodic leaming of claim 130 wherein a third set of classifier- 
based performance metrics includes a false positives per hour measure, an average 
detection delay measure, an error risk measure, and an overall risk measure. 

134. The system for periodic leaming of claim 133 wherein the overall risk measure is 
determined by applying penalty fiictors to the probability of a fidse negative, the 
probability of a false positive, the probability of a correct positive, and the 
probability of a correct negative. 

135. The system for periodic learning of claim 128 wherein the learning and trainiag 
module ftirther comprises programming logic for detenmning an optimal decision 
rule based on a selected optunality criterion. 

136. The system for periodic leaming of claim 135 wherein the optrmaHty criterion is 
any one of a maximum likelihood estimate, a minimum error estimate, a Neyman- 
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Pearson criterion, a minunum error risk estimate, a minimax risk estimate, and a 
minimimi overall risk estimate. 
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